Diagnostics for #39

Diagnostics are built to help us, but they can be very misleading.

My most successfull model, which stands at #39 at the time of writing (nyuton_test8) has so bad diagnostics, that I almost threw it out at the beginning and I haven’t started staking on it until recently.

Trust you CV!
I’m writing this post partly for myself as a reminder, when I see similar results with the new dataset.
To be frank, this model uses an ensemble of models from CV folds. Data also includes validation set. This is the diagnostic of the only part that doesn’t include validation data at all. Probably a good approximation of the other models as well.

An other model (nyuton_test15) looks ever worse than that. And it got 14 medals in its first 9 completed rounds…


Defintely a great reminder, trust your CV indeed. Although I’d still keep an eye on risk because getting metals+high rank may also be a result of high variance models, which tend to do worse when regime changes

1 Like

Great reminder! But what is #39?

The other way around is also definitely the case, I had a NN with ‘reasonable’ diagnostics (everything green) which also performed well for 5-6 rounds, after that it got burned 4 rounds, only to recover now again. And no, I didnt use val for training :slight_smile: (but i suspect there is a leakage because of determining the amount of epochs with the use of val)

Ranks 39 on the leaderboard.

So your CV results looked very good but it did poorly on the full diagnostics?

Yes, CV was great! Validation score was not so great!

my most profitable model - running for 4 months with top 100 3M return - also have a rubbish diagnostics score - I was able to pick it out for staking because it seems to have by far the most stable daily score, additionally, after I settled on my validation scheme it was shown to be 4th among my models on non-corr related metrics i.e. ratios based metrics

So yes, people should try to create validation methods that are at least not entirely dependent on the provided validation data.

Out of curiosity, what kind of mean correlation numbers do you see in your CV runs?


0.043 when trained on train+validation

Thank you! Very helpful. I’ve been leaning on validation metrics for so long I didn’t know what to expect from a CV score!

One other question: When you find a good set of hyperparameters using CV on some number of folds of the combined train+validation data, do you then train your final model (the one you use for tournament prediction) by using those HPs and all the available data (instead of folds)? I’m assuming not because then you wouldn’t have a sane measure of when to stop the training. But if not, then do you just stick with training on, say, 4 of 5 folds and validating on the 5th?

Thanks again.

Oh man, this is a dead simple random forest!
That’s the funny part of it. Apparently you can get this close to the top with an RF.
NNs might give you a boost in MMC. But in raw accuracy (CORR) RF is hard to beat it.

Yea, I was also wondering why the diagnostics is pretty much useless when comparing models. It’s a good sanity check, but not much more. Could it be that some people have leakage to the validation-set, and thus they skew the ranks in the diagnostics to the top?
At the beginning, I did use in CV the validation-set and even though the model never trained on the validation-set, this still lead to a massive overfit; All my metrics were light-green (99+), but the live tests were absolutely terrible (Numerai before round 274).

Everybody had bad rounds before R274: https://dashboard.numeraipayouts.com/ The median score was pretty much negative. You need to judge the relative performance of you model on live data, not the performance in isolation.


I agree, and this is what I’m doing currently. However, over 4 rounds my model was
much, much worse than other models. Maybe I should have generated more data but being in 2 of 4 rounds below a 20 percentile is enough for me to say that it is “not optimal, probably”.

That doesn’t say much at all, you need to test longer. The above mentioned model has also has more than 2 round in the lower 20 percentile…


This is for the old data format (310 features) or the new one?

BTW: I can corroborate your initial statement on the diagnostics. They don’t always indicate whether a model is good or not.

The diagnostics for this model are appalling but… its a decent model. peaked at 4 for MMC and 750 or so for correlation. A consistent performer, more green than red. 19 medals in all.


Those appalling diagnostics have garnered two more silver medals. I don’t get it.