SQL and the Dataset

I have a question…

How can I find out the records and fields for the dataset?
Can anyone point me to the information?

have you tried to download a copy of dataset first ?

I have and when I try to open my laptop locks up, not sure why. It feels like I can remember being able to open the file.

The dataset is much bigger now so you might be maxing out your laptop’s RAM.

1 Like

It’s best to use the API. You can then get a smaller version of the data. The best thing to try is by way of the github repository: Numerai · GitHub.

You can use this chunk from example_scripts/example_model.py, change the "medium" to "small" for the smallest version of the data.

print('Reading minimal training data')
# read the feature metadata and get a feature set (or all the features)
with open("v4/features.json", "r") as f:
    feature_metadata = json.load(f)
# features = list(feature_metadata["feature_stats"].keys()) # get all the features
# features = feature_metadata["feature_sets"]["small"] # get the small feature set
features = feature_metadata["feature_sets"]["medium"] # get the medium feature set
# read in just those features along with era and target columns
read_columns = features + [ERA_COL, DATA_TYPE_COL, TARGET_COL]

# note: sometimes when trying to read the downloaded data you get an error about invalid magic parquet bytes...
# if so, delete the file and rerun the napi.download_dataset to fix the corrupted file
training_data = pd.read_parquet('v4/train.parquet',
                                columns=read_columns)
validation_data = pd.read_parquet('v4/validation.parquet',
                                  columns=read_columns)
live_data = pd.read_parquet(f'v4/live_{current_round}.parquet',
                                  columns=read_columns)

Latest dataset size makes it almost impossible to do meaningful work in google colab with 12GB RAM. I am using kaggle notebooks to get 16GB RAM and so far so good.
I have made public my kaggle download notebook with dataset of current round. If you would be running in your kaggle notebook, you can chain notebooks and use already downloaded data from output of my notebook.