Diagnostics are built to help us, but they can be very misleading.
My most successfull model, which stands at #39 at the time of writing (nyuton_test8) has so bad diagnostics, that I almost threw it out at the beginning and I haven’t started staking on it until recently.
Trust you CV!
I’m writing this post partly for myself as a reminder, when I see similar results with the new dataset.
To be frank, this model uses an ensemble of models from CV folds. Data also includes validation set. This is the diagnostic of the only part that doesn’t include validation data at all. Probably a good approximation of the other models as well.
Defintely a great reminder, trust your CV indeed. Although I’d still keep an eye on risk because getting metals+high rank may also be a result of high variance models, which tend to do worse when regime changes
The other way around is also definitely the case, I had a NN with ‘reasonable’ diagnostics (everything green) which also performed well for 5-6 rounds, after that it got burned 4 rounds, only to recover now again. And no, I didnt use val for training (but i suspect there is a leakage because of determining the amount of epochs with the use of val)
my most profitable model - running for 4 months with top 100 3M return - also have a rubbish diagnostics score - I was able to pick it out for staking because it seems to have by far the most stable daily score, additionally, after I settled on my validation scheme it was shown to be 4th among my models on non-corr related metrics i.e. ratios based metrics
So yes, people should try to create validation methods that are at least not entirely dependent on the provided validation data.
One other question: When you find a good set of hyperparameters using CV on some number of folds of the combined train+validation data, do you then train your final model (the one you use for tournament prediction) by using those HPs and all the available data (instead of folds)? I’m assuming not because then you wouldn’t have a sane measure of when to stop the training. But if not, then do you just stick with training on, say, 4 of 5 folds and validating on the 5th?
Oh man, this is a dead simple random forest!
That’s the funny part of it. Apparently you can get this close to the top with an RF.
NNs might give you a boost in MMC. But in raw accuracy (CORR) RF is hard to beat it.
Yea, I was also wondering why the diagnostics is pretty much useless when comparing models. It’s a good sanity check, but not much more. Could it be that some people have leakage to the validation-set, and thus they skew the ranks in the diagnostics to the top?
At the beginning, I did use in CV the validation-set and even though the model never trained on the validation-set, this still lead to a massive overfit; All my metrics were light-green (99+), but the live tests were absolutely terrible (Numerai before round 274).
Everybody had bad rounds before R274: https://dashboard.numeraipayouts.com/ The median score was pretty much negative. You need to judge the relative performance of you model on live data, not the performance in isolation.
I agree, and this is what I’m doing currently. However, over 4 rounds my model was
much, much worse than other models. Maybe I should have generated more data but being in 2 of 4 rounds below a 20 percentile is enough for me to say that it is “not optimal, probably”.
The diagnostics for this model are appalling but… its a decent model. peaked at 4 for MMC and 750 or so for correlation. A consistent performer, more green than red. 19 medals in all.