SQL and the Dataset

crownholder · April 19, 2022, 2:49am

I have a question…

How can I find out the records and fields for the dataset?
Can anyone point me to the information?

autratec · April 19, 2022, 3:27am

have you tried to download a copy of dataset first ?

crownholder · April 19, 2022, 3:31am

I have and when I try to open my laptop locks up, not sure why. It feels like I can remember being able to open the file.

by256 · April 19, 2022, 8:54am

The dataset is much bigger now so you might be maxing out your laptop’s RAM.

jefferythewind · April 19, 2022, 4:51pm

It’s best to use the API. You can then get a smaller version of the data. The best thing to try is by way of the github repository: Numerai · GitHub.

You can use this chunk from example_scripts/example_model.py, change the "medium" to "small" for the smallest version of the data.

print('Reading minimal training data')
# read the feature metadata and get a feature set (or all the features)
with open("v4/features.json", "r") as f:
    feature_metadata = json.load(f)
# features = list(feature_metadata["feature_stats"].keys()) # get all the features
# features = feature_metadata["feature_sets"]["small"] # get the small feature set
features = feature_metadata["feature_sets"]["medium"] # get the medium feature set
# read in just those features along with era and target columns
read_columns = features + [ERA_COL, DATA_TYPE_COL, TARGET_COL]

# note: sometimes when trying to read the downloaded data you get an error about invalid magic parquet bytes...
# if so, delete the file and rerun the napi.download_dataset to fix the corrupted file
training_data = pd.read_parquet('v4/train.parquet',
                                columns=read_columns)
validation_data = pd.read_parquet('v4/validation.parquet',
                                  columns=read_columns)
live_data = pd.read_parquet(f'v4/live_{current_round}.parquet',
                                  columns=read_columns)

svendaj · June 10, 2022, 11:58am

Latest dataset size makes it almost impossible to do meaningful work in google colab with 12GB RAM. I am using kaggle notebooks to get 16GB RAM and so far so good.
I have made public my kaggle download notebook with dataset of current round. If you would be running in your kaggle notebook, you can chain notebooks and use already downloaded data from output of my notebook.

Topic		Replies	Views
[Newbie Question!] Error When Running Example Notebook to Load Data Data Science	2	1224	December 25, 2021
Which is the current dataset? Tournament	22	1916	November 9, 2022
Problems downloading data Tournament	5	751	June 19, 2023
Huge memory (and speed) differences between v4.1 and v4 data Tournament	9	1109	September 18, 2023
Data Availability and Compression Methods Tournament	1	509	November 1, 2022

SQL and the Dataset

Related topics