Cross-validation done right

nyuton · April 20, 2021, 1:28pm

Hi,

I’ve recently read Marcos Lopez de Prado’s great book on “Advances in Financial Machine Learning”.
I’ve learnt quite a few things and I would like to share some improvements I made.

Doing cross-validation properly is something that greatly affected my model selection process and improved my confidence in my models.

He suggests making best use of the data we have and do cross-validation the following way:

Split you dataset into N splits (6 in this example)
Take all possible combinations of k splits as validation set (k=2 in this example)
Use the rest of the data as training set.

By splitting the data into N=6 splits and using and using k=2 splits as validation set, you end up with 15 valid combinations of train/validation sets.
He argues, when data in the splits are independent and non-overlapping validation on data that preceeds training data is a valid process. In our case the eras in the numerai training data are non-overlapping.

You can download and tweak my code here:

Yes, it takes a lot of time to train these models, but at least you can trust the results.
Have fun!
Feedback is welcome!

schot · April 20, 2021, 2:43pm

It’s a common practice for time series analysis to make sort of some lagged features like price of the security a month ago. I don’t know how long, but it’s safe to have some gap between training and validation set, I guess.

I don’t have any hard evidence that the dataset is overlapping in terms of eras though. It’s just my speculation.

nyuton · April 20, 2021, 3:38pm

The code I shared has the parameter “embargo”. That’s the minimum gap between train and validation.
I’ve tried a couple of values, but it doesn’t have any significant effect. Seems like the eras are not overlapping.

jackerparker · April 21, 2021, 8:04am

Hi nyuton,

Your groups (G) are divided using next eras: 1-30 (G1), 31-60, 61-90, 91-120, 121-132 and 197-212 (G6).
What do you think about having the same number of eras in every group? 1-25 (G1), 26-50 (G2) …

Or even more, having the same number of rows in every group? First eras have less rows, thus, G1-G2 groups could contain more eras than G3-G4 groups, but the number of rows will be close enough between groups.

Regards,
Mark

nyuton · May 2, 2021, 8:56am

Hi JackerParker,

my splits are somewhat arbitrary, you are right. I wanted to split the validation eras into different splits, so that I have a score on the numerai validation set. That’s the reason, why I chose these splits.

More gouprs would give better granuality, but this is already too time consuming. CV with random forests can run for hours on my computer.

Feel free to tweak it!
I wanted to share the idea and the base code, but there is certainly some room for improvement!

Topic		Replies	Views
Time series CV & seperation to live data Data Science	5	964	November 13, 2022
Stories of Validation Data Science	5	2555	March 28, 2020
Era-wise Time-series Cross Validation Data Science	24	11364	November 5, 2021
Incorporating Val2 in Training Data Science	5	1717	May 5, 2020
How does training data and validation data relate in "time"? Tournament	8	1815	May 6, 2021

Cross-validation done right

Related topics