Which is the current dataset?

kayeffnumeraitor · November 8, 2022, 7:30am

I think there might be still some misunderstanding. Both files v4/train.parquet and v4/validation.parquet have the same feature set, that is, they all cointain the same columns named with “feature_XXX…”. If not, then most probably something in your downloading/preprocessing pipeline is broken.
Because they share the same featureset, the files train.parquet and validation.parquet can be concatenated together to form the entire Numerai dataset.
This combined dataset represents a time series dataset containing at this time of writing 1035 eras, where each era is one week apart. Since ~1000 samples is a rather low count for time series data (no matter how much information there is per sample), every additional era has a very high value in the data set, so I wouldn’t call any of the data “useless”.

wigglemuse · November 8, 2022, 12:34pm

Yes, you should be seeing the same features for all v4 files, and yes the validation set is just a continuation of the training set with an arbitrary break point.

liborty · November 9, 2022, 6:21am

Thank you for that reassurance, @kayeffnumeraitor and @wigglemuse. It turns out I was indeed still picking up v3 of the train data by a mistake.

Topic		Replies	Views
Super Massive Data Release: Deep Dive Data Science	81	21378	November 22, 2021
Download v2/dataset for daily uploads (Numerai Classic) Tournament	9	988	November 3, 2022
Which dataset should I use? Tournament	1	635	September 18, 2022
Bye-Bye, V4 Data Tournament	1	480	September 29, 2024
Discrepancy in versioning between Numerai database and NumerAPI? Tournament	5	921	March 31, 2022

Which is the current dataset?

Related topics