Cross-validation done right


I’ve recently read Marcos Lopez de Prado’s great book on “Advances in Financial Machine Learning”.
I’ve learnt quite a few things and I would like to share some improvements I made.

Doing cross-validation properly is something that greatly affected my model selection process and improved my confidence in my models.

He suggests making best use of the data we have and do cross-validation the following way:

  1. Split you dataset into N splits (6 in this example)
  2. Take all possible combinations of k splits as validation set (k=2 in this example)
  3. Use the rest of the data as training set.

By splitting the data into N=6 splits and using and using k=2 splits as validation set, you end up with 15 valid combinations of train/validation sets.
He argues, when data in the splits are independent and non-overlapping validation on data that preceeds training data is a valid process. In our case the eras in the numerai training data are non-overlapping.

You can download and tweak my code here:

Yes, it takes a lot of time to train these models, but at least you can trust the results.
Have fun!
Feedback is welcome!


It’s a common practice for time series analysis to make sort of some lagged features like price of the security a month ago. I don’t know how long, but it’s safe to have some gap between training and validation set, I guess.

I don’t have any hard evidence that the dataset is overlapping in terms of eras though. It’s just my speculation.

The code I shared has the parameter “embargo”. That’s the minimum gap between train and validation.
I’ve tried a couple of values, but it doesn’t have any significant effect. Seems like the eras are not overlapping.

1 Like

Hi nyuton,

Your groups (G) are divided using next eras: 1-30 (G1), 31-60, 61-90, 91-120, 121-132 and 197-212 (G6).
What do you think about having the same number of eras in every group? 1-25 (G1), 26-50 (G2) …

Or even more, having the same number of rows in every group? First eras have less rows, thus, G1-G2 groups could contain more eras than G3-G4 groups, but the number of rows will be close enough between groups.


Hi JackerParker,

my splits are somewhat arbitrary, you are right. I wanted to split the validation eras into different splits, so that I have a score on the numerai validation set. That’s the reason, why I chose these splits.

More gouprs would give better granuality, but this is already too time consuming. CV with random forests can run for hours on my computer.

Feel free to tweak it!
I wanted to share the idea and the base code, but there is certainly some room for improvement!