Super Massive Data Release: Deep Dive

master_key · September 11, 2021, 2:35pm

We added an old data download/upload version back to the website

master_key · September 11, 2021, 2:40pm

Sorry about that. I’ve made some updates to it since the original drop, if you pull the latest changes it should work. You can reach out to me in RocketChat if you keep having issues

gammarat · September 11, 2021, 3:29pm

No you don’t. If you look under the “Download Data” and “Upload Predictions” buttons, you’ll see a link the says “legacy data”. Click on that and you’ll get the buttons for the old data downloads and uploads. (Thanks @master_key!)

muppetshow · September 11, 2021, 6:37pm

Been just about coping with old format with 32GB ram, but seems I need an upgrade. Is feeding this in chunks to a GPU (12gb?) practical?

bob_watson · September 11, 2021, 6:45pm

I’m trying to submit and getting “PermissionError Forbidden”
Same trying diagnostics on a validation set which worked yesterday!

Just started working! OK now!

objectscience · September 11, 2021, 8:00pm

I ran a partial stress test on it yesterday. Set downsampling to 1 and n_estimators to 40k. I let it get into the second split of the CV before killing it. No issues up to that point. Where did it fail?

donk · September 12, 2021, 5:55am

Deep dives come with risk of drowning…
Hopefully all will manage to surface with good results!

jefferythewind · September 12, 2021, 10:16am

Hi, Can anyone tell me how to download the new data with numerapi? I can confirm that my legacy code continues working by downloading the legacy data, but what changes do I need to make the numerapi code to get the new data in my pipeline. I didn’t see it mentioned anywhere yet or on the numerapi docs. Thanks!

EDIT: I see how this is explained in the github repo, for anyone else looking.

fwaris · September 12, 2021, 5:39pm

Parquet format suggestions / issues;

a) There should be multiple small partitions instead of a single huge partition. Presently you have to load the entire set into memory to do something with it. Smaller partitions will allow streaming data processing - useful for many scenarios such as simple transforms or serving data in chunks.

b) The the ‘thrift’ schema in the parquet files shows “target…” columns to be nullable single (32bit float values) but in actuality the data is nullable double values for those columns. This issue probably does not show up in Python (which is not statically typed) but does cause problems in other languages/platforms. Ideally the data should conform to the schema.

gammarat · September 12, 2021, 6:38pm

Or you could write a simple routine to break the parquet data up into whatever format is most convenient for your routines and save the output locally? That’s what I do, and I break it into separate, directly loadable files listed by era, with the Id column kept in a separate file for each data_type. When it comes to processing, I then can load only the era data I’m looking for, and if it’s already in the appropriate binary format, one doesn’t have to slow down for translation.

Getting it broken down quickly is a function of efficient memory and reducing file calls. So right now I take in ~50 feature columns to assemble 200 eras at a time, and save those separately. On my home box that takes about 30 minutes for the full data set. I can probably reduce that more (it would have been multi-hours doing one feature column to assemble on era at a time, for example), but this does for now.

I find the Parquet files mush easier to work with than CSV.

taori · September 13, 2021, 7:36pm

The new targets are regularized in different ways and exhibit a range of correlations with each other from around ~0.3 to ~0.9. Due to this regularization you may find that models trained on some of the new targets generalize to predict “target” better than models trained on “target”. Other targets may yield models that appear to generalize poorly to “target” but end up helping in an ensemble.

Can someone explain the rationale behind this approach? Why using a “wrong” target for training would help the real target? Is this technique specific to finance data (more noise than signal) or is this a general idea in ML?

restrading · September 14, 2021, 1:47am

I think this is more related to the concept of generalization in ML where uncorrelated ensemble (in this case as a result of training on different targets) reduces variance in out of sample data at a slight cost of bias.

lothlorien · September 14, 2021, 7:44am

So how do I download the supermassive dataset via GraphiQL? It seems I should pass the round number for the NEW dataset, like this?

{dataset(tournament:8,round:281)}

but the resulting data is the old set.

lothlorien · September 15, 2021, 9:46am

I figured it out:
query { listDatasets }
will give you a list of file names available.
To get the download link for the legacy data, you do NOT specify a filename:
query { dataset ( tournament:8, round:281 ) }
To get the new, supermassive data as a .zip file containing .parquet files:
query { dataset ( tournament:8, round:281, filename:“numerai_datasets.zip” ) }
To get the new, supermassive data as .csv or the int8 versions, you need to specify which file you want, e.g.:
query { dataset ( tournament:8, round:281, filename:“numerai_training_data.csv” ) }

johnnywhippet · September 16, 2021, 8:31pm

yeah but i find csv files easier to manipulate than parquet files.

testnet666 · September 19, 2021, 10:55pm

Getting a 403 Client Error: Forbidden when trying to download the ‘old_data_new_val.parquet’ file in either parquet or CSV versions. Anyone have a solution for this?

thekizoch · September 20, 2021, 6:53pm

Is there a timeline estimation for when the legacy dataset will be deprecated?

taori · September 21, 2021, 6:09pm

Is there a timeline estimation for when the legacy dataset will be deprecated?

Same here, I would like to know how much time I have to migrate my code to the new dataset.

maxchu · September 22, 2021, 5:15am

I found out that some targets has “nan” values, like “target_arthur_60” has 20599 nans in the “numerai_training_data.parquet”. Is it normal?

gammarat · September 22, 2021, 6:02am

Yes. I don’t remember the exact number offhand though, and I don’t think there were any in the primary target.

Topic		Replies	Views
New data and the example predictions Tournament	4	1371	January 6, 2022
Performing Exploratory Data Analysis on Numerai Tournament Data with R Data Science	3	6330	December 2, 2021
Super Massive Data: Sunshine Announcements	24	7790	March 23, 2023
Bye-Bye, V4 Data Tournament	1	471	September 29, 2024
Validation 2 Announcement Announcements	0	4792	April 14, 2020

Super Massive Data Release: Deep Dive

Related topics