Maybe we can share some experiences on how to do a good validation, that is - how to get your validation holdout eras be a reliable indicator of how well your model will perform on live data.
It has been a mixed bag for me. One of my long-running stable models (18 weeks) offered a good opportunity to see if the 51 eras that I held out for validation were showing the same distribution of scores as my live scores. Those 51 eras weren’t selected at random, but spread out over an axis based on a second model - one that is trained on a small subset of eras and performs very well during times of no burn, and very poorly during burns (called goodtimes). So the 51 eras I used as validation sort of cover the range of eras that numer.ai provides in a more equal way than if you would just select 51 eras at random.
Basically what you expect if validation eras and live eras are similar [and as far as I can tell, that is still true - live is a subset on the y-axis scale here that overlaps with the validation eras], is that the two point-clouds are hard to distinguish.
Sadly, for model Thirteen, that wasn’t the case. About half the live eras performed considerable worse than anything I had seen in the validation dataset (top half). Weirdly enough, I have another model - badtimes. For that one, validation and live are spot-on (bottom half). (The figure uses the performance of the goodtimes model for every era to spread out the points on the Y-axis. If you only have the validation and live scores of your model to look at how the two match up, you can make two density plots and see if they overlap. But I find it convenient to have this second axis to visualize the data with)
It is how well live and validation matches for my badtimes model that gives me confidence that the data that numer.ai is providing is not too old to be useful. The live scores that badtimes gets, based on training on 18 eras (18 different eras than goodtimes is trained on) matches the validation scores I have for badtimes as well, and the part to the bottom right where the live scores extend into territory not covered by the validation eras, is where the training eras of badtimes are.
Lessons learned. Thirteen was an attempt at getting a model that does well during bad (burn) times and good times. It was a bit more complicated than shown here - After I had a model that I liked, and validation didn’t look too bad, I ran the model two more times, with different sets of holdouts picked at random, and then averaged all three models together. The other two model runs performed worse than the first model run, in terms of validation performance. But together, things looked even better for validation (almost all eras positive). In retrospect, I should have discarded the first model (where validation didn’t look too bad) - as by iterating on models until I had one where validation didn’t look bad, I for sure overfitted.
I have no clue why my extreme models (both goodtimes and badtimes) have such a good correspondance between live and validation, but one lesson learned is that spending time on understanding the relationship between your validation scores and your live scores is time well spend. So keep one or two models around for a long time to check if your validation scores match up to the live scores!