Diagnostics for #39

nyuton · September 19, 2021, 8:40am

Diagnostics are built to help us, but they can be very misleading.

My most successfull model, which stands at #39 at the time of writing (nyuton_test8) has so bad diagnostics, that I almost threw it out at the beginning and I haven’t started staking on it until recently.

Trust you CV!
I’m writing this post partly for myself as a reminder, when I see similar results with the new dataset.
To be frank, this model uses an ensemble of models from CV folds. Data also includes validation set. This is the diagnostic of the only part that doesn’t include validation data at all. Probably a good approximation of the other models as well.

An other model (nyuton_test15) looks ever worse than that. And it got 14 medals in its first 9 completed rounds…

restrading · September 19, 2021, 9:42am

Defintely a great reminder, trust your CV indeed. Although I’d still keep an eye on risk because getting metals+high rank may also be a result of high variance models, which tend to do worse when regime changes

sunkay · September 19, 2021, 11:16am

Great reminder! But what is #39?

qeintelligence · September 19, 2021, 11:53am

The other way around is also definitely the case, I had a NN with ‘reasonable’ diagnostics (everything green) which also performed well for 5-6 rounds, after that it got burned 4 rounds, only to recover now again. And no, I didnt use val for training (but i suspect there is a leakage because of determining the amount of epochs with the use of val)

nyuton · September 19, 2021, 12:36pm

Ranks 39 on the leaderboard.

neosbrother · September 19, 2021, 1:45pm

So your CV results looked very good but it did poorly on the full diagnostics?

nyuton · September 19, 2021, 5:27pm

Yes, CV was great! Validation score was not so great!

yxbot · September 20, 2021, 9:46pm

my most profitable model - running for 4 months with top 100 3M return - also have a rubbish diagnostics score - I was able to pick it out for staking because it seems to have by far the most stable daily score, additionally, after I settled on my validation scheme it was shown to be 4th among my models on non-corr related metrics i.e. ratios based metrics

So yes, people should try to create validation methods that are at least not entirely dependent on the provided validation data.

profricecake · September 22, 2021, 2:21am

Out of curiosity, what kind of mean correlation numbers do you see in your CV runs?

Thanks

nyuton · September 22, 2021, 6:20am

0.043 when trained on train+validation

profricecake · September 22, 2021, 2:32pm

Thank you! Very helpful. I’ve been leaning on validation metrics for so long I didn’t know what to expect from a CV score!

profricecake · September 22, 2021, 9:34pm

One other question: When you find a good set of hyperparameters using CV on some number of folds of the combined train+validation data, do you then train your final model (the one you use for tournament prediction) by using those HPs and all the available data (instead of folds)? I’m assuming not because then you wouldn’t have a sane measure of when to stop the training. But if not, then do you just stick with training on, say, 4 of 5 folds and validating on the 5th?

Thanks again.

nyuton · September 23, 2021, 6:59am

Oh man, this is a dead simple random forest!
That’s the funny part of it. Apparently you can get this close to the top with an RF.
NNs might give you a boost in MMC. But in raw accuracy (CORR) RF is hard to beat it.

kenfus · September 23, 2021, 9:37am

Yea, I was also wondering why the diagnostics is pretty much useless when comparing models. It’s a good sanity check, but not much more. Could it be that some people have leakage to the validation-set, and thus they skew the ranks in the diagnostics to the top?
At the beginning, I did use in CV the validation-set and even though the model never trained on the validation-set, this still lead to a massive overfit; All my metrics were light-green (99+), but the live tests were absolutely terrible (Numerai before round 274).

themicon · September 23, 2021, 10:08am

Everybody had bad rounds before R274: https://dashboard.numeraipayouts.com/ The median score was pretty much negative. You need to judge the relative performance of you model on live data, not the performance in isolation.

kenfus · September 23, 2021, 10:29am

I agree, and this is what I’m doing currently. However, over 4 rounds my model was
much, much worse than other models. Maybe I should have generated more data but being in 2 of 4 rounds below a 20 percentile is enough for me to say that it is “not optimal, probably”.

nyuton · September 23, 2021, 10:59am

That doesn’t say much at all, you need to test longer. The above mentioned model has also has more than 2 round in the lower 20 percentile…

factorsparsity · September 25, 2021, 8:20pm

This is for the old data format (310 features) or the new one?

BTW: I can corroborate your initial statement on the diagnostics. They don’t always indicate whether a model is good or not.

johnnywhippet · September 27, 2021, 3:49pm

The diagnostics for this model are appalling but… its a decent model. peaked at 4 for MMC and 750 or so for correlation. A consistent performer, more green than red. 19 medals in all.

johnnywhippet · September 29, 2021, 6:39pm

Those appalling diagnostics have garnered two more silver medals. I don’t get it.

Topic		Replies	Views
Interpreting Model Diagnostics Data Science	0	762	March 30, 2021
Does Good Model Diagnostics Correlate with Tournament Performance? Data Science	13	3004	February 7, 2021
How are others improving/working on their models after a bad round? Data Science	6	1244	June 23, 2021
Diagnostic Tool Tournament	2	763	April 8, 2023
How to test my submissions? Tournament	8	750	October 27, 2022

Diagnostics for #39

Related topics