(Way too early) Comparison of legacy & new models

Hello all:

How are you guys doing with your new models trained from the super massive dataset?
or are you diamond handing your good old legacy models?

Now that the new dataset had been around for almost 4 months, and some of the new models have up to 10 resolved rounds under their belt, I thought it would be interesting to make some overall comparison.

Here is the comparison view of all of my legacy models v.s. new models - the first two are corr, and the last two are corr percentile

corr (upper/lower: legacy/new)

corr percentile (upper/lower: legacy/new)

To be clear: I think it is far too early to draw conclusions on any model’s performance until they have more than 20 resolved rounds but still find this interesting :slight_smile:

what stands out for me for now is model performance divergency - my legacy models do tend to go up and down together, some are more stable than others, but they more or less bundle together. the new models however seem to be behaving quite differently in this aspect. For instance, for round289 my new models have more or less covered the whole spectrum, have not seen model performance spread quite so widely from my legacy bunch…

I am more or less using the same data pre-processing steps, similar algorithms, and not quite different validation setup. My guess is that the wider choice of features, and the newly available alternative targets are contributing quite heavily to this divergence.

Are you guys seeing the same phenomena?

May the burn be with you! :smiley:


Could the difference in performance (legacy vs new) be due to the difference in targets? 20D1L vs 20D2L nomi?

Lol this was not how @arbitrage used to say it


I thought both legacy and new model are 20D1L up till now?

If I understand correctly, the training target for new data is already 20D2L, but we are currently scored on 20D1L untill soon

I’m quite happy with the new data and have been phasing out my legacy models; the last of those I submitted 3 weeks or so ago.

Part of that is due to the size of the new data forcing me to rethink my approach, so with round 281 I took advantage of the extension to 50 models from 30 and introduced a somewhat different algorithm as well. The cumulative results are shown below:

All but one are grey, that’s just because of the way Numerai has the colour tables set up. I have 2 parameters that govern this model, one that can take one of 4 discreet integer values, the other can take 5, which results in the clustering of the tracks. These are GammaRat 31 through 50.

Around round 285 (iirc, I don’t really keep track) I took the model above and introduced a new parameter, and replaced GammaRat 11 to 30 with that. (Those had previously used the legacy model). About half have really improved, and the rest not. But that’s ok, because it’s giving me a decent view of how these parameters interact. So I’m using that info to once again redesign the underlying algorithm. Fun and games, to be sure :laughing:

GammaRat 1 to 10, my last legacy models, got dropped three weeks ago and replaced with a similar algorithm to the ones above. They don’t look great, but right now it’s too early to say for sure.

1 Like

or shall I say “may the burn be against you” :joy:

“May the burn be in your favor™” :slight_smile:


oh haha I see, appropriately trademarked

1 Like

nice, interesting to see how this developed.
I haven’t fully replicated all my modelling methods on the new dataset yet - with the newer data update coming, probably I will do more after December.

Nevertheless, after a few tough rounds, most of the legacy models seem to have recovered - some of them never suffered in the first place - so I am just happy seeing them running. I would definitely keep most of my new models at least for 20 rounds to see how they play out longer terms :slight_smile:

I think a possibly underrated subject of the new data is all the new targets. I think they’ve been maybe more helpful to me than the new data itself.


Do you have the cumulative scores plotted? I can’t make much out of these plots.

My experience is that my legacy workflow has beat a similar workflow on the new data. I don’t have all the comparison data together to show, but yesterday my legacy model returned 4.8% and was 95 percentile on corr and mmc. My co-modeler burned -0.4% and was 40 percentile on corr and mmc. I’m not excited about being forced down the new data route.

No not yet, good idea though, I will add that when I come to further work in my dashboard

Yes that is one of my observations at the moment, from my point of view seems some of the new targets are more volatile than Nomi

Not sure how much they help though, probably need a longer runway to see. For now my legacy models are out performing the new ones simply by being more stable

It is my recent rounds observation too, although I think more rounds are needed to draw any conclusions.

I don’t like being pushed to use new dataset neither. Have you tried to 300+ features they said are closely related to the legacy features?

I have not tried only using those. I didn’t realize that there was a list of features that were “closely related” to legacy features. I remember the question being asked about which features are the old features and the answer being that none are the same because of the timing differences. Is that list published or maybe I just missed it in the original announcement?

Needs fact checking but from memory the tournament 3 month avg was around 15-20% when the new data came out and is now at 7.7%. I was around 25% when it came out and now I’m at 49%. I’m assuming that the meta model is currently dominated by models on the new data set, while I have been playing it safe and sticking with my legacy model while I see how the new data performs.

Green is staked legacy, Orange is unstaked legacy experiments, Cyan is unstaked super massive.


from the team’s October Updates
under New Feature Metadata - the “legacy” set

Legacy: 304 of the original 310 features that were carried over to the new dataset. You can use this set to achieve nearly the same model as the legacy data.

1 Like

Are we to assume that it is the first 304 features?