SQL and the Dataset

I have a question…

How can I find out the records and fields for the dataset?
Can anyone point me to the information?

have you tried to download a copy of dataset first ?

I have and when I try to open my laptop locks up, not sure why. It feels like I can remember being able to open the file.

The dataset is much bigger now so you might be maxing out your laptop’s RAM.

1 Like

It’s best to use the API. You can then get a smaller version of the data. The best thing to try is by way of the github repository: Numerai · GitHub.

You can use this chunk from example_scripts/example_model.py, change the "medium" to "small" for the smallest version of the data.

print('Reading minimal training data')
# read the feature metadata and get a feature set (or all the features)
with open("v4/features.json", "r") as f:
    feature_metadata = json.load(f)
# features = list(feature_metadata["feature_stats"].keys()) # get all the features
# features = feature_metadata["feature_sets"]["small"] # get the small feature set
features = feature_metadata["feature_sets"]["medium"] # get the medium feature set
# read in just those features along with era and target columns
read_columns = features + [ERA_COL, DATA_TYPE_COL, TARGET_COL]

# note: sometimes when trying to read the downloaded data you get an error about invalid magic parquet bytes...
# if so, delete the file and rerun the napi.download_dataset to fix the corrupted file
training_data = pd.read_parquet('v4/train.parquet',
                                columns=read_columns)
validation_data = pd.read_parquet('v4/validation.parquet',
                                  columns=read_columns)
live_data = pd.read_parquet(f'v4/live_{current_round}.parquet',
                                  columns=read_columns)