Super Massive Data Release: Deep Dive

If you were using Numerapi to download the dataset previously, it will continue to download the old dataset. With Numerapi you need to actively switch to using the new data. This was done to not break people’s compute pipelines.


thanks for the clarification. it might solve my problem in the short term as i am using colab to download and submit some predictions. But my another route will be impacted as i also download the data, unzip them and upload to Azure Machine Learning (studio) for prediction. The whole space Azure offer on cloud is 10G :frowning:

Looking at the number of eras, do I understand correctly that the old train and validation1 (eras 121-132) become the current train period (but now with weekly data instead of monthly) and the the old validation 2 became the current validation (but again upsampled) and another year of data is added to the dataset.

Also can we know to what dates the eras match to?

1 Like

I have some problems running the new scripts. I might be missing something but I cannot find the download_dataset method for numerapi. There is a download_current_dataset and a download_latest_data but both are missing the round argument. What is the difference between those two? Many thanks in advance

1 Like

@foolish_observer It was added recently. Maybe you need to upgrade to the latest version. It looks like the functions you named will be deprecated in the future.


I’m still pretty new to coding, but I would like to try the dividing the features into blocks of 210. How could you do that? Is there a specific pandas function that can be used?

check out this cheatsheet:

1 Like

Would it be possible to provide training data in CSV format? I’m stuck with 16 gigs of RAM for now and it is simply impossible to work with so large parquet file. I would like to play with new data and as CSV I could split the training dataset into multiple chunks to fit it into memory…

Or I’m just dumb and there is some simple way to split/partition file even if it is in parquet format?

If the parquet file is too big, then the CSV will be way, way too big. [edit] Oops, I forgot that those formats are only on disk, not in RAM. They’ll be the same in RAM (in numpy/pandas).

Soon you’ll be able download an int8 version of the training data (features are 0, 1, 2, 3, & 4). Pandas + 16 Gb can read that parquet file. For further work, you could compress it by:
for i in range(210): (cooked feature)[i:] = sum( (raw feature)[i : : 210] )

I have Colab Pro with 35.25G RAM and both example codes crashes due to lack of RAM! How much RAM is needed to run the example codes? Or do you have any tips / tools to lessen RAM consumption?

from RocketChat #announcements
Ok I have heard your primary feedback:

  1. How do we compare our old model performance to our new model performance?
  2. Data too big

Addressing both of these:

  1. There’s a new file accessible via api called old_data_new_val.parquet
    using the utils in the new example scripts you can run download_data(napi, 'old_data_new_val.parquet', 'old_data_new_val.parquet', round=280). This will give you the old data, but over the exact same period as the new validation. You will then be able to run your existing models and submit the predictions to diagnostics to get a 1 to 1 comparison against models built on the new data.

  2. I’ve placed new files called numerai_validation_data_int8.parquet, numerai_training_data_int8.csv, etc. These have features as integers 0 to 4, which result in DataFrames about 30% as large.
    I’ve also added numerai_live_data.parquet and numerai_live_data_int8.parquet which only contain the live era each week.

The int8 files will be available for each round so you can make your pipelines expect those if you’re having RAM issues.



You can use the old api call like you have in colab on your local machine to download the old zip locally and then you should be set for your Azure upload path. You can also try the int8 version of the new data as it is quite small

1 Like

Thank you for releasing new dataset.
Using int8 dataset, I think some of the memory issues will be resolved, but I’ve published a Kaggle Notebook that train sample 1/4 of the era discussed in this topic, and I’ll share it.
Using DuckDB, I was able to read and train only specific eras even with a float dataset.
But now, thanks to the int8 dataset, we can map all training data into memory and still train without DuckDB.


anyone run without error?

1 Like

Someone more think that this change should be produced releasing the new data format and giving two-three weeks for adjusting and test models before use in production?
I don’t understand why after several months using old data you need to go live with the new in three days…
Still you have 8 hours to reconsider this and delay a bit the new data challenge.
Release old data for this and next weeks and give time to test and adjust our work,
Giving only a few days you are understimating the effort people are doing. Some people only can work on this in weekends. I haven’t time to check the new data and for sure more people are in same situation.


@eleven_sigma the team at Numerai have said the legacy format data will be continuing, it is not stopping right now, so you can continue with it and move over to the new data format when convenient for you.


Yes but you need to use the API. I haven’t time this weekend to adjust it. I think announcing it a Thursday to begin in the same week is absolutely unfair.


I feel numerai is under estimate the imapct to those part time “data scientist”, using their own computer resource and time and try to meet the weekly commitment.

If old API is still working, why we just continue providing old dataset as download files ?get two processes running parallelly can minimise the change impact.


thought there was some talk in rocketchat about adding a button for submitting old data

Yes. I just checked the chat room and looks like legacy data download and submission will still be provided. I suggest the COE should conduct a post mortem review of this event. It is a good intention to provide better quality data from scientist perspective. But the whole change impact was under estimated. The response from technical team is fast. Hope we can manage it better next time.

1 Like