I’ve recently read Marcos Lopez de Prado’s great book on “Advances in Financial Machine Learning”.
I’ve learnt quite a few things and I would like to share some improvements I made.
Doing cross-validation properly is something that greatly affected my model selection process and improved my confidence in my models.
He suggests making best use of the data we have and do cross-validation the following way:
- Split you dataset into N splits (6 in this example)
- Take all possible combinations of k splits as validation set (k=2 in this example)
- Use the rest of the data as training set.
By splitting the data into N=6 splits and using and using k=2 splits as validation set, you end up with 15 valid combinations of train/validation sets.
He argues, when data in the splits are independent and non-overlapping validation on data that preceeds training data is a valid process. In our case the eras in the numerai training data are non-overlapping.
You can download and tweak my code here:
Yes, it takes a lot of time to train these models, but at least you can trust the results.
Feedback is welcome!