Era Splitting paper and code are now open source and available online.
Paper: https://arxiv.org/abs/2309.14496
Implementation: GitHub - jefferythewind/scikit-learn-erasplit
Notebooks: GitHub - jefferythewind/era-splitting-notebook-examples: Example Notebooks to Replicate Experiments from Era Splitting Paper
Quant Club: https://www.youtube.com/watch?v=HOmHxuRQy18
This project was a deep dive into the inner workings of one of our favorite ML algorithms: gradient boosted decision trees, but is applicable to decision tree-based models in general. In this new algorithm, which you can install via the second link above, we incorporate the era-wise information into two new splitting criteria: era splitting and directional era splitting. Details are outlined in the paper and also the youtube feature from the Quant Club.
Since the Quant Club presentation, I’ve developed more the directional era splitting criterion, and I’ve applied the algorithm to another synthetic data set that is designed to test this kind of algorithm. It is called the synthetic memorization data set from the paper: * Learning explanations that are hard to vary*, by Parascandolo et. al.. This data set is designed to confound and befuddle naive ML models. A complex invariant swirl signal is embedded in the data along with simple spurious signals that shift from one era to another. In the test set the swirl remains but the spurious signals disappear completely. Naive models learn the spurious signals, and at test time performance is poor since they did not learn the invariant signal. Indeed this is what happens with common gradient boosted decision tree models. However with directional era splitting, we are able to perform almost perfectly on the test set, meaning our model has ignored the spurious signals and learnt the invariant swirl signal. Era splitting also performs better than the naive model. Here is an excerpt figure from the paper.
This good result comes on top of meaningful results on another synthetic data set (the shifted sine wave), as well as good results discussed in the Quant Club video on the Numerai data set. However our goal with this project was really to improve considerably on the baseline LGBM model that is used commonly in the example notebooks. We wanted to really blow it out of the water. This goal proved hard to achieve. I believe the ideas are worthwhile. I hope some member of the community and help push forward this research.
Time Complexity
On major issue is that these new splitting techniques introduce increased time complexity on the order of the number of eras in the training data. We have to perform that many more operations per split to compute our desired criteria. Improving the algorithm to reduce this time complexity would make it much nicer to work with when you want to tackle big data sets like the Rain data with all the features and eras. For the Numerai result posted in the paper, it took a couple of days to train. However the toy examples in the notebooks are fast and easy to get started.
Signals
Why not try it out on your signals model? The era splitting helped mine out more than on the classic data. This also took less time to train since I use less features and a lower number of iterations.
Era Groups
Can we group eras together in a smart way to reduce the number of eras (computations) while improving the out of sample performance? This is an idea I played around with and seems to improve performance in certain cases. It seems like also a simple way to get speed up in run time.
Feedback
Please let me know with any bugs or feedback, I’ll do my best to support the project.
Happy Coding