Which is the current dataset?

I think there might be still some misunderstanding. Both files v4/train.parquet and v4/validation.parquet have the same feature set, that is, they all cointain the same columns named with “feature_XXX…”. If not, then most probably something in your downloading/preprocessing pipeline is broken.
Because they share the same featureset, the files train.parquet and validation.parquet can be concatenated together to form the entire Numerai dataset.
This combined dataset represents a time series dataset containing at this time of writing 1035 eras, where each era is one week apart. Since ~1000 samples is a rather low count for time series data (no matter how much information there is per sample), every additional era has a very high value in the data set, so I wouldn’t call any of the data “useless”.

1 Like

Yes, you should be seeing the same features for all v4 files, and yes the validation set is just a continuation of the training set with an arbitrary break point.

1 Like

Thank you for that reassurance, @kayeffnumeraitor and @wigglemuse. It turns out I was indeed still picking up v3 of the train data by a mistake.