Which dataset should I use?

ryo_matsuzaka · September 18, 2022, 11:53am

Hello. I am newbie of NUMERAI.
I started NUMERAI today.
I have a question about the dataset.

I found three options to download dataset.

Using API
Like this:
napi.download_dataset("v4/train.parquet")

ref: https://github.com/numerai/example-scripts/blob/master/example_model.py

From S3 backet
Like this:
training_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz")

From NUMERAI dashboard

I tried to use all dataset for creating prediction.csv but I could succeed in running diagnostic tool when I chose prediction with 2nd option. For others, I got the message:

Your upload seems to be invalid:

high_invalid_ticker_count: Looks like your upload had 0% of the correct IDs.Make sure you’re predicting on the newest Validation data for round 334.

Which dataset should I use?

kayeffnumeraitor · September 18, 2022, 7:02pm

In the end, it is your own decision which data to use. If you want to use the latest dataset, use the v4 dataset (your first option). If you want to try things out and get started quickly you can use the legacy “v2” dataset, as it is less memory hungry. You could also start with the example scripts. I haven’t tried to run them, but they are at least a source of some basic ideas, probably also which dataset to use.

Topic		Replies	Views
Download v2/dataset for daily uploads (Numerai Classic) Tournament	9	988	November 3, 2022
New data and the example predictions Tournament	4	1374	January 6, 2022
Numerai Datasets url Data Science	2	1010	June 18, 2022
Which is the current dataset? Tournament	22	1917	November 9, 2022
Super Massive Data Release: Deep Dive Data Science	81	21378	November 22, 2021

Which dataset should I use?

Related topics