Which dataset should I use?

Hello. I am newbie of NUMERAI.
I started NUMERAI today.
I have a question about the dataset.

I found three options to download dataset.

  1. Using API
    Like this:
    napi.download_dataset("v4/train.parquet")

ref: https://github.com/numerai/example-scripts/blob/master/example_model.py

  1. From S3 backet
    Like this:
    training_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz")

ref: Numerai

  1. From NUMERAI dashboard

image

I tried to use all dataset for creating prediction.csv but I could succeed in running diagnostic tool when I chose prediction with 2nd option. For others, I got the message:

Your upload seems to be invalid:

high_invalid_ticker_count: Looks like your upload had 0% of the correct IDs.Make sure you’re predicting on the newest Validation data for round 334.

Which dataset should I use?

In the end, it is your own decision which data to use. If you want to use the latest dataset, use the v4 dataset (your first option). If you want to try things out and get started quickly you can use the legacy “v2” dataset, as it is less memory hungry. You could also start with the example scripts. I haven’t tried to run them, but they are at least a source of some basic ideas, probably also which dataset to use.

2 Likes