Super Massive Data Release: Deep Dive

pyr395410 · September 25, 2021, 8:38am

seems that this may be accessed via GraphiQL using { dataset ( tournament:8, round:280, filename:“old_data_new_val.parquet” ) }

thekizoch · October 1, 2021, 3:28pm

I’ll make a new thread about it, unless you already have an answer?

wigglemuse · October 1, 2021, 5:53pm

No firm timeline – many months at least. (Maybe even never.)

eses · October 7, 2021, 4:35pm

Thanks for uploading the validation data for the old model. However, I got these errors when I ran the ‘download_data’ function:

TypeError: download_data() got an unexpected keyword argument ‘round’`

when I delete the “round=280” input, I got the following error:

HTTPError: 403 Client Error: Forbidden for url: https://numerai-datasets.s3.amazonaws.com/284/v3/old_data_new_val.parquet?

bundushathur · October 8, 2021, 11:05am

As pyr395410 pointed out you can get the old features data over the new validation eras like this:

query = “”"
query{
dataset(tournament: 8, filename: “old_data_new_val.parquet”, round: 280)
}
“”"
old_data_new_val_df = pd.read_parquet(napi.raw_query(query)[‘data’][‘dataset’])

bob_watson · October 9, 2021, 6:27pm

Trying to upload 285 and getting:
Runtime.ImportModuleError Unable to import module ‘tournament_validate’: No module named ‘pydantic’

taori · October 15, 2021, 2:01pm

Out of curiosity, why isn’t the tournament data released automatically for training once the tournament ends (so every week we should have new training data)?

vurehout66 · October 17, 2021, 11:12am

In my eyes very sloppy how the new dataset and the old exist and get retreived. I have not seen a clear example of the dataset being loaded. Also now not able to participate for 3 weeks because of that. The example notebook download_data does not work. Terrible…reverting back to:
napi.download_current_dataset(dest_path="…/286/", dest_filename=None,
unzip=True, tournament=8)
The above code only gives the CSV
How would I download the parquet with new data?

oliveoil · October 17, 2021, 11:29am

from numerapi import NumerAPI
# NumerAPI
napi = NumerAPI()
# (file, filename)
datasets = [('numerai_training_data_int8.parquet', 'training_data.parquet'),
            ('numerai_tournament_data_int8.parquet', 'tournament_data.parquet'),
            ('numerai_validation_data_int8.parquet', 'validation_data.parquet'),
            ('numerai_live_data_int8.parquet', 'live_data.parquet'),
            ('example_validation_predictions.parquet', 'example_val_pred.parquet'),
           ]
# Download datasets
for dataset in datasets:
    napi.download_dataset(*dataset)

vurehout66 · October 17, 2021, 11:31am

Not sure how you get that to work, it throws:
AttributeError: ‘NumerAPI’ object has no attribute ‘download_dataset’

oliveoil · October 17, 2021, 11:33am

Did you update to the most recent version? The command was first added in version 2.8.0 according to the changelog:

https://numerapi.readthedocs.io/en/stable/changelog.html

vurehout66 · October 17, 2021, 11:40am

I don’t know how, just use pip install numerapi each time, but it does not seem to be updated then.
Ok managed to uninstall and install, thanks

jeremy_berros · October 17, 2021, 7:54pm

pip install --upgrade numerapi

luee · November 8, 2021, 4:58pm

Quick question on the data, as of right now the training data includes the era 1 to 574 and the validation data include the era 857 to 961 while the unlabeled tournament data include the missing 300 or so eras between 574 and 857. If I understand it correctly we are in essence missing roughly 6 recent years of data, and if so what is the reasoning behind this? It seems that including that data in the training set could yield much better performances

wigglemuse · November 8, 2021, 5:20pm

They’ve needed it for backtests and such – testing our models on eras we don’t have the targets for over a significant period. In the past they’ve said this is important for their own planning/optimization and also to show potential investors. However, recently they’ve indicated that they are going to release the targets for the test set also – I think they said probably in December. (Apparently the test set is no longer needed in this way internally?) Anyway, in a another month or so we should have that data too. (Of course sometimes plans change, we’ll see if it happens.)

luee · November 8, 2021, 5:24pm

Awesome thanks for the reply, that should give a healthy boost in performance to everyone

jaca_ml · November 14, 2021, 12:07pm

Hi, I have couple of questions that I have been thinking for a time and I didn’t find an answer to them.

Is it possible to know which number era is the live era? So that we can use the temporal information to make temporal features like: the mean of the targets when feature_1 is less than 0.5 in last era.

Another question that I have is: How is it possible that the validation data is more recent than the live data? It doesn’t make sense to me because we are predicting the next week in live

Thank you in beforehand mates

wigglemuse · November 14, 2021, 5:50pm

The live era is the final era in the tournament dataset each week. It doesn’t even have a number. Under the old system we’ve just left (before this massive data release), each week last week’s live era would simply be added to the test set. However, with the new data, that’s not happening anymore (the test set is remaining static), so you if you want last week’s live data you’d now have to save it each week yourself (or get it from somebody that has done that).

As far as the validation data being more recent than the live data, it isn’t, because as you say, that wouldn’t make sense.

jaca_ml · November 14, 2021, 9:22pm

Thank you very much! That was super helpful.

Last thing, we can save the live data but is there a way to save also the targets at the end of the week in the live data?

wigglemuse · November 14, 2021, 9:29pm

No – we never get the live targets under current scheme. This may change at some point though, they’ve been talking about it. (They have said already we are going to get the existing test set targets soon though.)

Topic		Replies	Views
Super Massive Data: Sunshine Announcements	24	7963	March 23, 2023
Which is the current dataset? Tournament	22	1967	November 9, 2022
V5 "Atlas" Data Release Announcements	33	4664	October 6, 2024
Download v2/dataset for daily uploads (Numerai Classic) Tournament	9	1044	November 3, 2022
16GB Intermediate solution: XGB Era Boosting Tournament	54	5910	April 1, 2022

Super Massive Data Release: Deep Dive

Related topics