Time series CV & seperation to live data

kayeffnumeraitor · November 10, 2022, 7:18am

While training one of my models by expanding time series cross validation, a random thought popped into my head:

Lets say I train a model on the first 100 eras, and test/select it during training on eras 110-120 to prevent any leakage. In the next fold I will train on the first 120 eras, test on eras 130-140, and so on. The furthest that I can do this fold is right until live era - 6, meaning right now I would train on the first 1010 eras and test on 1020-1030.

However, if I use the model from the last fold to do the live predictions, the time seperation from the train set is not 10 eras but 30 + 6 eras. So the performance numbers from the test sets have a different seperation window than what I need for the live data. But obviously if I increase the time seperation on the test data during training, this will only add up to the seperation from the live data.

How can I go around this? Is this even an issue? Until now I always used the model from the last fold.

nyuton · November 10, 2022, 8:28am

The data in the training set contains overlapping rounds!!!

When you train on the first 100 eras, you shouldn’t use eras 101-104 for testing, because in real life they are not available.

You should used purged time series cross validation.

sirbradflies · November 10, 2022, 8:42am

Hi,
You should see CV and Training as two separate steps. You may want to use CV (like through cross_val_score in sklearn) to get a sense of the model future performance (if you don’t use the CV data for model tuning) but then you can train the model on the full dataset so you don’t “waste” any data.

Hope it helps!

kayeffnumeraitor · November 10, 2022, 11:05am

Yes, that is why I said I train on first 100 eras, and will test on eras 110 - 120, so a 10 era gap between train and test to avoid leakage.

kayeffnumeraitor · November 10, 2022, 11:47am

I guess my problem boils down to this misunderstanding, will read again on CV

jmrichardson · November 13, 2022, 5:04pm

You don’t have to use the last model of the fold. You can just train a new model on the latest data (obviously without a test set) which would match your CV expanding window strategy. One thing I have found helpful is to not just look at a short test window (in your example 10 eras) but rather all the available data as you walk forward. It gives you a better since for how the model performs as market regimes change over time. Expanding does appear to be better performing than fixed window. Here’s an example of a simple model I was testing where you see the mean and sharpe for the entire test data set but also the first 20 and last 20 as you walk forward:

Topic		Replies	Views
Cross-validation done right Data Science	4	2270	May 2, 2021
Era-wise Time-series Cross Validation Data Science	24	11291	November 5, 2021
Era Purging to minimize data leakage between train/val/test Data Science	4	1806	July 27, 2020
Different time resolution for training and live eras Tournament	2	836	November 19, 2020
Incorporating Val2 in Training Data Science	5	1716	May 5, 2020

Time series CV & seperation to live data

Related topics