Super Massive Data Release: Deep Dive

seems that this may be accessed via GraphiQL using { dataset ( tournament:8, round:280, filename:“old_data_new_val.parquet” ) }

I’ll make a new thread about it, unless you already have an answer?

No firm timeline – many months at least. (Maybe even never.)

1 Like

Thanks for uploading the validation data for the old model. However, I got these errors when I ran the ‘download_data’ function:

TypeError: download_data() got an unexpected keyword argument ‘round’`

when I delete the “round=280” input, I got the following error:

HTTPError: 403 Client Error: Forbidden for url: https://numerai-datasets.s3.amazonaws.com/284/v3/old_data_new_val.parquet?

As pyr395410 pointed out you can get the old features data over the new validation eras like this:

query = “”"
query{
dataset(tournament: 8, filename: “old_data_new_val.parquet”, round: 280)
}
“”"
old_data_new_val_df = pd.read_parquet(napi.raw_query(query)[‘data’][‘dataset’])

1 Like

Trying to upload 285 and getting:
Runtime.ImportModuleError Unable to import module ‘tournament_validate’: No module named ‘pydantic’

1 Like

Out of curiosity, why isn’t the tournament data released automatically for training once the tournament ends (so every week we should have new training data)?

In my eyes very sloppy how the new dataset and the old exist and get retreived. I have not seen a clear example of the dataset being loaded. Also now not able to participate for 3 weeks because of that. The example notebook download_data does not work. Terrible…reverting back to:
napi.download_current_dataset(dest_path="…/286/", dest_filename=None,
unzip=True, tournament=8)
The above code only gives the CSV
How would I download the parquet with new data?

from numerapi import NumerAPI
# NumerAPI
napi = NumerAPI()
# (file, filename)
datasets = [('numerai_training_data_int8.parquet', 'training_data.parquet'),
            ('numerai_tournament_data_int8.parquet', 'tournament_data.parquet'),
            ('numerai_validation_data_int8.parquet', 'validation_data.parquet'),
            ('numerai_live_data_int8.parquet', 'live_data.parquet'),
            ('example_validation_predictions.parquet', 'example_val_pred.parquet'),
           ]
# Download datasets
for dataset in datasets:
    napi.download_dataset(*dataset)
2 Likes

Not sure how you get that to work, it throws:
AttributeError: ‘NumerAPI’ object has no attribute ‘download_dataset’

Did you update to the most recent version? The command was first added in version 2.8.0 according to the changelog:

https://numerapi.readthedocs.io/en/stable/changelog.html

2 Likes

I don’t know how, just use pip install numerapi each time, but it does not seem to be updated then.
Ok managed to uninstall and install, thanks

pip install --upgrade numerapi

4 Likes

Quick question on the data, as of right now the training data includes the era 1 to 574 and the validation data include the era 857 to 961 while the unlabeled tournament data include the missing 300 or so eras between 574 and 857. If I understand it correctly we are in essence missing roughly 6 recent years of data, and if so what is the reasoning behind this? It seems that including that data in the training set could yield much better performances

1 Like

They’ve needed it for backtests and such – testing our models on eras we don’t have the targets for over a significant period. In the past they’ve said this is important for their own planning/optimization and also to show potential investors. However, recently they’ve indicated that they are going to release the targets for the test set also – I think they said probably in December. (Apparently the test set is no longer needed in this way internally?) Anyway, in a another month or so we should have that data too. (Of course sometimes plans change, we’ll see if it happens.)

5 Likes

Awesome thanks for the reply, that should give a healthy boost in performance to everyone

1 Like

Hi, I have couple of questions that I have been thinking for a time and I didn’t find an answer to them.

Is it possible to know which number era is the live era? So that we can use the temporal information to make temporal features like: the mean of the targets when feature_1 is less than 0.5 in last era.

Another question that I have is: How is it possible that the validation data is more recent than the live data? It doesn’t make sense to me because we are predicting the next week in live

Thank you in beforehand mates :smiley:

1 Like

The live era is the final era in the tournament dataset each week. It doesn’t even have a number. Under the old system we’ve just left (before this massive data release), each week last week’s live era would simply be added to the test set. However, with the new data, that’s not happening anymore (the test set is remaining static), so you if you want last week’s live data you’d now have to save it each week yourself (or get it from somebody that has done that).

As far as the validation data being more recent than the live data, it isn’t, because as you say, that wouldn’t make sense.

3 Likes

Thank you very much! That was super helpful.

Last thing, we can save the live data but is there a way to save also the targets at the end of the week in the live data?

No – we never get the live targets under current scheme. This may change at some point though, they’ve been talking about it. (They have said already we are going to get the existing test set targets soon though.)

3 Likes