Most of us are striving to create models that produce the best scores on the diagnostics, or at least that’s what we tell everyone else. But let’s be honest, we are actually trying to accumulate NMR as fast as we can, and the premise is that by training our models to optimize on the diagnostics we will earn more and burn less. So as any good data scientist would, I searched for models that maximized the diagnostic values, but when I looked at their performance, the models that earned the most were not the ones with the best diagnostics! So, what’s up? Maybe these metrics are good for NumerAI, but no so great for performance. Unfortunately, like everything else we do for NumerAI, my ground truth is limited since I just start feature neutralization in round 241, so I didn’t have a lot of performance data to use but I think the process I used is applicable to everyone.
What I set out to do is determine which diagnostics are correlated with performance and then select models that maximize those features. I also have another metric I use called validation_score, which is related to era consistency across the validation data (val1 and val2). So, in all, I used 10 metrics in my evaluation: validation_sharpe, validation_mean, feature_neutral_mean, validation_sd, feature_exposure, max_drawdown, corr_plus_mmc_sharpe, mmc_mean, corr_with_example_preds, and validation_score.
The first thing I did was calculate the average performance of all my staked models since I started feature neutralization in round 241:
|rank||account||average CORR||average MMC|
I was originally thinking I would have to do three analyses for CORR, MMC, and CORR+MMC but interestingly enough, the rank order of CORR and MMC are the same (that may not be the case for your models so you may have to repeat the process). Then I calculated the diagnostics for all of the staked models (note that d0…d9 correspond to the diagnostic order above):
Then I find the rank order of each model for each diagnostic:
Note that d0, d1, d2, d5, d6, d7, d9 are in descending order (higher values are better) and d3, d4, and d8 are in ascending order (lower values are better). Next, I find the correlation of each diagnostic rank with the performance rank:
|validation sharpe||validation mean||feature neutral mean||validation sd||feature exposure||max drawdown||corr plus mmc sharpe||mmc mean||corr with example preds||validation score|
Now it become clear which diagnostics are most important when selecting a model to upload or determining stake amounts. You can draw some interesting conclusions from this. I was most surprised by the inverse correlation with max drawdown, that means models with high drawdowns when evaluated on the validation data perform better on live data! Also, consistency across eras is the most important metric (.457).
I had 51 trained models, so I created a normalized vector based on the above correlations and calculated a predicted score for each model and got the following results (sorted by score, lower is better):
I then selected the top 15 models to be uploaded and staked more on the models with a lower (better) predicted score.
If you are submitting multiple models and spreading your stake amongst them like me, this may be a good way to select models and stake them according to predicted performance and not just the diagnostics. Let me know if you have any ideas for improvement or if I made any errors. May your earns be strictly greater than your burns.