Super Massive Data Release: Deep Dive

We added an old data download/upload version back to the website


Sorry about that. I’ve made some updates to it since the original drop, if you pull the latest changes it should work. You can reach out to me in RocketChat if you keep having issues

No you don’t. If you look under the “Download Data” and “Upload Predictions” buttons, you’ll see a link the says “legacy data”. Click on that and you’ll get the buttons for the old data downloads and uploads. (Thanks @master_key!)


Been just about coping with old format with 32GB ram, but seems I need an upgrade. Is feeding this in chunks to a GPU (12gb?) practical?

I’m trying to submit and getting “PermissionError Forbidden”
Same trying diagnostics on a validation set which worked yesterday!

Just started working! OK now!

I ran a partial stress test on it yesterday. Set downsampling to 1 and n_estimators to 40k. I let it get into the second split of the CV before killing it. No issues up to that point. Where did it fail?

Deep dives come with risk of drowning…
Hopefully all will manage to surface with good results!


Hi, Can anyone tell me how to download the new data with numerapi? I can confirm that my legacy code continues working by downloading the legacy data, but what changes do I need to make the numerapi code to get the new data in my pipeline. I didn’t see it mentioned anywhere yet or on the numerapi docs. Thanks!

EDIT: I see how this is explained in the github repo, for anyone else looking.

1 Like

Parquet format suggestions / issues;

a) There should be multiple small partitions instead of a single huge partition. Presently you have to load the entire set into memory to do something with it. Smaller partitions will allow streaming data processing - useful for many scenarios such as simple transforms or serving data in chunks.

b) The the ‘thrift’ schema in the parquet files shows “target…” columns to be nullable single (32bit float values) but in actuality the data is nullable double values for those columns. This issue probably does not show up in Python (which is not statically typed) but does cause problems in other languages/platforms. Ideally the data should conform to the schema.

1 Like

Or you could write a simple routine to break the parquet data up into whatever format is most convenient for your routines and save the output locally? That’s what I do, and I break it into separate, directly loadable files listed by era, with the Id column kept in a separate file for each data_type. When it comes to processing, I then can load only the era data I’m looking for, and if it’s already in the appropriate binary format, one doesn’t have to slow down for translation.

Getting it broken down quickly is a function of efficient memory and reducing file calls. So right now I take in ~50 feature columns to assemble 200 eras at a time, and save those separately. On my home box that takes about 30 minutes for the full data set. I can probably reduce that more (it would have been multi-hours doing one feature column to assemble on era at a time, for example), but this does for now.

I find the Parquet files mush easier to work with than CSV.

1 Like

The new targets are regularized in different ways and exhibit a range of correlations with each other from around ~0.3 to ~0.9. Due to this regularization you may find that models trained on some of the new targets generalize to predict “target” better than models trained on “target”. Other targets may yield models that appear to generalize poorly to “target” but end up helping in an ensemble.

Can someone explain the rationale behind this approach? Why using a “wrong” target for training would help the real target? Is this technique specific to finance data (more noise than signal) or is this a general idea in ML?

I think this is more related to the concept of generalization in ML where uncorrelated ensemble (in this case as a result of training on different targets) reduces variance in out of sample data at a slight cost of bias.


So how do I download the supermassive dataset via GraphiQL? It seems I should pass the round number for the NEW dataset, like this?


but the resulting data is the old set.

I figured it out:
query { listDatasets }
will give you a list of file names available.
To get the download link for the legacy data, you do NOT specify a filename:
query { dataset ( tournament:8, round:281 ) }
To get the new, supermassive data as a .zip file containing .parquet files:
query { dataset ( tournament:8, round:281, filename:“” ) }
To get the new, supermassive data as .csv or the int8 versions, you need to specify which file you want, e.g.:
query { dataset ( tournament:8, round:281, filename:“numerai_training_data.csv” ) }

1 Like

yeah but i find csv files easier to manipulate than parquet files.

Getting a 403 Client Error: Forbidden when trying to download the ‘old_data_new_val.parquet’ file in either parquet or CSV versions. Anyone have a solution for this?

1 Like

Is there a timeline estimation for when the legacy dataset will be deprecated?

1 Like

Is there a timeline estimation for when the legacy dataset will be deprecated?

Same here, I would like to know how much time I have to migrate my code to the new dataset.

I found out that some targets has “nan” values, like “target_arthur_60” has 20599 nans in the “numerai_training_data.parquet”. Is it normal?

Yes. I don’t remember the exact number offhand though, and I don’t think there were any in the primary target.