V5 "Atlas" Data Release

ark · July 17, 2024, 3:02pm

V5 “Atlas” Data is here. Don’t worry, nothing is changing right now. You cannot submit on v5 yet. Live data will be released in September.

This release expands and improves the universe we use to craft our dataset, thus evolving our features and targets to be more predictive. Because features and targets significantly changed, you will need to retrain your models as any model trained on v4.x data will soon be obsolete. The new data offers more diversification in rows in each era, higher correlation, and lower variance. Our research shows that most models will simply have higher CORR when retrained with this data.

For example, when we compare the performance of our hello_numerai tutorial notebook running on v4.3 vs v5 we can see a drastic improvement in performance with no change to the underlying model:

CORR	v4.3	v5.0
mean	0.0245	0.0293
std	0.021	0.021
sharpe	1.1379	1.335
max drawdown	0.0637	0.0329

Furthermore, since our hedge fund is now trading a different universe than v4, we must have predictions on all of these new stocks which aren’t in the old data version. We had hoped that old models would be able to predict the new universe just as well, but instead our internal research found that there is a steep decline in performance when v43 models attempt to predict on v5.

It’s clear that the new universe forces a breaking change, but we are aiming to give everyone plenty of time to re-train their models on the new dataset. For now, we have only released the v5 training and validation datasets so you can start exploring. Here is what is currently available in the API:

  "v5.0/features.json",
  "v5.0/train.parquet",
  "v5.0/train_benchmark_models.parquet",
  "v5.0/validation.parquet",
  "v5.0/validation_benchmark_models.parquet",
  "v5.0/validation_example_preds.csv",
  "v5.0/validation_example_preds.parquet"

Here is the current roadmap for this data release:

July 19

Support v5 data in diagnostics

September 13

Release v5 live data:

"v5.0/live.parquet"
"v5.0/live_benchmark_models.parquet"
"v5.0/live_example_preds.csv"
"v5.0/live_example_preds.parquet"

Support v5 data in Model Uploads
Start accepting v5 submissions, but v5 submissions will not be scored

September 17

Update example scripts and tutorial notebooks
Start submitting benchmark models to website

September 27

Change all scores and payouts to v5
Stop supporting v4 data
Stop accepting v4 submissions

FAQ
What’s different about v5?
The universe (the list of stocks we are willing to trade) has changed. This means features and targets have also changed due to the nature of ranked residual returns. Because of this, we have changed the feature names. In our research, we have found that models trained on v5 data are significantly better at predicting targets than v4 models.

Can I still train and submit on v4 data?
Yes. All v4 data (v4 through v4.3) is still supported for the next 2 months. You can train and predict on all v4.x until September 27. After this date, all v4.x data will no longer be updated and you will no longer be able to use it for live prediction.

What will happen to my models?
Nothing for now. No breaking changes will happen until September 27. On September 27, if your model is still submitting on v4, it will fail to submit.

Where is the live data?
It’s coming in September. Live data is not publicly available because you will not be able to submit predictions on v5 until September.

Can I use v4.3 models on v5 data?
No. V5 changes the universe of stocks used to craft the dataset, thus changing IDs, features, and targets. Our research has shown that v4 models are extremely bad at predicting v5 - do not use v4 models to predict v5.

Where are all of the targets?
Upon release, the datasets only have 4 targets: cyrus_20, cyrus_60, teager_20, and teager_60. After we some more research and development, we will include the following targets in the v5 datasets:

Ralph
Victor
Tyler
Waldo
Alpha
Bravo
Charlie
Delta
Echo
Jeremy
Teager
Cyrus
Caroline
Sam
Xerxes

ark · July 17, 2024, 6:47pm

FYI, if you take a look at our example_model - it’s performance on v4.3 validation data is as follows:

And after updating to use v5 data it’s performance measurably improves:

psyrex · July 17, 2024, 6:51pm

I think the images are duplicated @ark

slashv · July 17, 2024, 6:57pm

I respect your dedication to keeping numerai up-to-date, however… in this post: Midnight Data Release from a mere 6 months ago, you stated:
"
We’ve been moving at a crazy pace for the last year or so, releasing new data or targets or payouts every couple of months, and we know some of you have whiplash trying to keep up.

There are no other target or data releases on our immediate calendar.
"
Followed by that you can’t guarantee anything of course…

Now this ‘promise’, if I can call it that isn’t even mentioned here and we’re now looking at a forced transition in a 2 week period in September? What if you happen to not have any time available during those 2 weeks?
Long story short, this certainly doesn’t do much good for my confidence in Numerai.

ark · July 17, 2024, 8:05pm

You’re right, just updated the image, thanks for pointing out.

ark · July 17, 2024, 8:44pm

Sorry you feel that this degrades your confidence in Numerai. Our research has shown that the new universe in v5 data is a significant improvement from v4.x datasets and that the universes don’t mix well between data versions. This is why we must perform the hard cutoff of v4.x data.

slashv · July 17, 2024, 10:03pm

As said, I understand you need to setup the data in such a way that maximizes the fund’s performance. On the other hand, it’s not clear to me how data scientists will benefit from this even though they are the ones who will have to make the effort to at least re-train their models and potentially do more than that to get good performance on the new data.
It’s quite possible that I don’t fully understand the relation between the fund doing well and the benefit to data scientists. Maybe you could elaborate on this? I could for example imagine that the fund would buy numeraire when doing well and thus support data scientist by driving up the token’s price, but I haven’t read anything of the kind. If there’s no benefit for data scientists, then data updates are just additional work and something like a predictable data update schedule would be quite desirable imho.

rustydata · July 18, 2024, 2:16am

Same features, renamed, and rescaled. And it does model better. It’d be nice if we could upload before the 17th.

joakim · July 18, 2024, 2:28am

I agree with this. I’d be happy to beta test v5.0 on live from Aug 17 @ark .

svendaj · July 18, 2024, 10:32pm

Those who are using Kaggle, can now access public dataset with V5 data - numerai latest tournament data (kaggle.com). It will be automatically updated about 15 minutes after Saturday round opening (webhook). Here is public Kaggle notebook producing this dataset: numerai data (kaggle.com).

No need to download the data over and over again. Just attach the dataset or notebook as your input and it will be available for your experiments.

liborty · July 20, 2024, 10:21pm

What exactly are the changes to features and targets? How many more? What values? Could you give a summary here please, so we do not have to download gigabytes of data and search through it? Also, how many instances (rows) in training+validation sets?
By how much is the overall volume of data growing again?

wigglemuse · July 20, 2024, 11:42pm

There are more rows in every era – picking a random recent era it’s got 1487 more rows than the equivalent era in the v43 dataset. (Older eras the increase isn’t as great seems like.)

That’s the bulk of the change. There are no new features. However, since all feature and target values come in buckets in a specific fixed distribution, the addition of new rows means that a certain number of values in each row are going to change. So the row in each dataset representing the same stock in the same era with the same features will nevertheless have different feature and target values. (So more accurately, there are more rows, and all the old rows have changed as a result.)

The other part of the change (so far) is they’ve only got cyrus & teager targets included (20d & 60d), but they are promising to add back at least some of the targets we had before.

And for some reason (as of this writing) the feature column names have all changed so we aren’t quite sure if the old column order (with the “same” features) is preserved or not. And the data seems to be an inconsistent state right now as I write this, but I’m sure that will get worked out in a few days. So unless you’re in a hurry I’d watch the discord (and this thread) and give it a few days at least to see if it stabilizes and I’m hoping we’ll get some of those targets back sooner rather than later.

liborty · July 21, 2024, 1:20am

Thanks for that. That should be backwards compatible with my way of processing. I do not need any more targets. The only regret is that I will still need to be downloading the whole lot every week.

neosbrother · July 25, 2024, 4:17pm

The performance difference is great to see. The v5 data contains new stocks that aren’t in v4.3, right? Is there any data on how the v4.3 stocks do in the v5 data? I’m wondering if the improvement is due to the added stocks being more predictable or if the improvement is across the board.

rdugh · July 29, 2024, 12:18pm

@ark Thanks for the update. A few questions:

When will you add all the above-mentioned targets?
Is the dataset fully ready to train new models? I see the targets follow a new naming convention (‘target_cyrusd_20’, ‘target_cyrusd_60’, ‘target_teager2b_20’, ‘target_teager2b_60’). Based on previous datasets, the name was like ‘target_cyrus_v4_20’.
Can you provide some details on your criteria for including or excluding the targets? For example, include Jeremy (instead of Jerome). I remember Jerome outperforming at one point.
What is the difference in features (or engineering of new features) between v4.3 and v5.0 to get that additional boost in performance?

ark · July 29, 2024, 10:57pm

V5 Update: Jul 29, 2024

We have released more targets into the v5 dataset.
The following targets are available now:

ralph
victor
tyler
waldo
alpha
bravo
charlie
delta
echo
jeremy
teager2b
cyrusd
caroline
sam
xerxes
rowan
agnes
claudia

FAQs

Is the dataset ready now?
Yes. We don’t plan on making any fundamental changes to the data. You should begin training on it as soon as possible.

Why these targets?
We decided to only add targets that were most pertinent to this dataset and were easiest to release with minimal research. Some of the targets are upgraded versions of older targets and any targets not included in this release are either too old or have not been researched enough to ensure predictive quality.

numerologist · August 2, 2024, 5:50pm

First off, I share your sentiment, I really do. Some of my best models will be dead after the V5 transition, because they were meticulously crafted and pre-trained on very specific sets of eras and with very specific seeds. To do all of this again takes a lot of time and resources (which is not very feasible with the current payout structure), and might not even be possible anymore (due to much lower drawdowns in the new data).

But here’s the reality check: there is a big difference between what’s good for the tournament and what’s good for Numerai’s (benchmark) models. And you can guess where I’m heading.

V5 does look better on paper. On the other hand, is it fair toward us, participants, to nuke everything that we built before V5 and on short notice? Probably not, but hey, this is the game: if you have to rely on Numerai, then you accept the rules of the game, and “the game” comes with all the broken promises and breaking changes. If you don’t like it, you can sail on your own, or try something that breaks less often - like Signals.

There is an upside to it though: new blood will come and might do something great with it. Or not. It actually doesn’t matter, as long as the benchmarks can keep the fund afloat. And, at the end of the day for this business, that’s the only thing that matters.

slashv · August 2, 2024, 6:35pm

@numerologist, well put. The “if you don’t like it, feel free to leave” argument always work of course and is fair enough. I am fairly new to the tournament, so I wasn’t really aware of the “rules of the game” and this is an introduction to working with a financial institution for me: “broke promises and breaking changes”. I should have known

pschyska · September 5, 2024, 6:27pm

@ark Why did you remove the v43_to_v5_map?
I didn’t see any indication that my old V4.3 models didn’t work on V5 data. They actually get much better validation results on V5, probably because V5 seem much easier to predict.

Additionally, when manually looking at the last few weeks worth of live rounds, the intersecting IDs features don’t seem to change very much (apart from a few 1-token changes at the fringes, e.g. “1” becoming “2”), but I didn’t analyze it much TBH.

Here are some validation results of a few of my models:
p_test_a: Is an old model trained on v4.3 originally with an adapter that can accept v4.3 or v5.0 feature names, and will map v5.0->v4.3 if needed. (I call this “454”, because it will actually map v4.3->v5.0->v4.3, where the first transformation is a no-op when taking v5.0 data in - that way the model can take either). The corr graph looks similar, and the performance on v5.0 is much better, even though nothing besides mapping feature names changed (0.0337 corr 1.8104 sharpe v5.0 vs 0.0303 corr 1.4699 sharpe v4.3).

hxxps://photos.app.goo.gl/oLUkzw5pUnLYrD699
hxxps://photos.app.goo.gl/HJ2ii5EZA8SvBkHdA
(Can’t post images or links )

p_test_j: A new model that I trained up on v5.0 data, with “45” adapter, i.e. will map v4.3->v5.0 if called with v4.3, and take v5.0 as-is. This works fine on v4.3 data, even though it was trained on v5.0 (0.0367 corr 1.7474 sharpe v5.0, 0.0327 corr 1.5078 sharpe v4.3)

hxxps://photos.app.goo.gl/xbPNpZ82Q7Y7MDKf6
hxxps://photos.app.goo.gl/MUSXfqbKVa5Av41y5

As both of these use benchmark model ensembling, here is a v4.3 model that doesn’t use benchmark models as well:

p_test_c: 0.0272 corr 1.5576 sharpe v5.0, 0.0264 corr 1.3589 sharpe v4.3. This model is obviously very bad, but again it seems to work fine on v5.0, and still having better validation results.

hxxps://photos.app.goo.gl/Sp6qtdrqU3RHEQjc7
hxxps://photos.app.goo.gl/rDkP6SKu2sqS9HUg6

Am I missing something?

The v5.0 launch is honestly very stressful for me. First, you indicated that we’d have to scrap everything and start from scratch because of the feature renaming. Then I saw that you added the v43 feature map, and was relieved that I could at least still run my best v4.3 models after writing the adapter code. Now, you removed the map and I’m back at square 0. All that in an extremely tight timeframe.

stochastic_geometry_1 · September 9, 2024, 3:08am

Just to be explicit here. V5 submissions will be possible from Sept 13 onwards. Will they be scored? Honestly reading the timeline here it seems that no scoring is enabled until Sept 27? So we have to upgrade to our V5 model without knowing how it scores? This seems sub-optimal.

Topic		Replies	Views
Super Massive Data: Sunshine Announcements	24	7830	March 23, 2023
Super Massive Data Release: Deep Dive Data Science	81	21476	November 22, 2021
V4 Tournament Data Announcement Announcements	0	3456	March 28, 2022
Performing Exploratory Data Analysis on Numerai Tournament Data with R Data Science	3	6352	December 2, 2021
Are predictions discrete or continuous? Tournament	19	3933	May 22, 2021

V5 "Atlas" Data Release

V5 Update: Jul 29, 2024

FAQs

Related topics