I was going to write an internal email to the team at Numerai but figured I’d share this here.
Numerai uploads our own LightGBM internal models trained on on various targets. We call these Benchmark Models so Numerai data scientists can compare themselves to them and evaluate their performance against them.
Each of the targets on Numerai handle risks in different ways so models trained on them can perform quite differently out of sample. We don’t give all the details on this but the Jerome target handles risks in a more vanilla way than Ralph. Ralph is supposed to be lower risk than Jerome and Cyrus is supposed to be lower risk than Ralph.
When we create targets we aren’t engineering them to be good in sample or anything like that, we are merely trying to reduce risks in the targets in a technical sense. For example, raw return would be a high risk target because it contains factor risks and a lower risk target would take those factor risks out.
Here are the three benchmark models trained on each of the targets. Note the first two are trained on v4 data and Cyrus is trained on v41 data and we don’t currently have a Benchmark Model for Cyrus trained on v4 as it was developed more recently.
https://numer.ai/lg_lgbm_v4_jerome20
https://numer.ai/lg_lgbm_v4_ralph20
https://numer.ai/lg_lgbm_v41_cyrus20
It’s surprising to me that in this challenging period for models on Numerai, that the safer Cyrus model has by far the worst TC ranked 9877th on the leaderboard whereas the Jerome model is ranked 499th in terms of TC. And not only that even on CORR20V2 which is measured against Cyrus the Cyrus model does the worst, Ralph does the second worst and Jerome does the best with a CORR20V2 of 0.0108 (ranking 159th on the leaderboard). This is the opposite order one would expect which presents an interesting anomaly for data scientists to research.
Of course, this is just 1 year of data and that’s not nearly enough time to draw conclusions. Nevertheless it’s interesting to ask:
What techniques would improve drawdowns over the past year?
How different are the feature exposures in models trained on Cyrus vs models trained on Jerome?
Do the features which Cyrus models cling to stop working in the recent period?
Does feature neutralization (An introduction to feature neutralization / exposure) or era boosting (Era Boosted Models) help in the recent drawdown?
Are there features which are particularly responsible for 2023’s bad performance?
Does anyone have a model which does not drawdown or have the same shape as these Benchmark Models in terms of cumulative Cyrus correlation that they’d like to talk about?
Of course, one of the reason Numerai releases so many targets is we want the Numerai community to be able to experiment with many different targets even though we score on the Cyrus target. As you can see in the past year, having some exposure to a model trained on Jerome may have attenuated drawdowns and burns when combined with models trained on other targets.
We’re busy researching even more targets. We hope to release them soon.