Adding this to the forum as I think it’s an important topic. Here I’ll focus mainly on the potential ‘data leakage’ issue, rather than on the more general problem of over-fitting, which could be due to many different reasons.
In short, eras are one month long, chronological, and non-overlapping in time. However, assuming some features are based on company financials, and since financials are usually reported quarterly, each era could potentially include information from the previous 1-2 eras. To me this seems like it could be a concern, as good validation and test performances could be mainly due to data leakage, and those optimized models wouldn’t necessarily perform as well on live data.
Marcos Lopez de Prado (MLdP) proposes in chapter 7 of his book Advances in Financial Machine Learning to ‘purge’ some time series after each train and test set, to overcome this problem. He also suggests to ‘embargo’ some additional time series after the test set, if purging only is not sufficient.
@richai mentioned in today’s OHw @arbitrage that models (and features?) with high auto-correlation could potentially be impacted by data leakage between eras, as well as some shorter term features (shorter window length?)? ← Please correct me if I’m misrepresenting anything here.
My main questions are:
- Is data leakage between eras a concern with the Numerai dataset?
- If it is, how many eras should one purge? Purging 2 eras after each train and val/test set seems like it should be sufficient.
- Other than comparing live performances, is there a way to check if a model optimized on a purged dataset performs better than one that’s been optimized on a non-purged dataset?
- If Numerai provided an additional grouping, e.g. 3 eras = 1 epoch, that is less likely to have data leakages between epochs, would that help?
- Perhaps I’m overthinking this way too much and it’s not really a concern at all?