Training, Validation, Test, and Live datasets

  1. Would it be correct to say that there are two perspectives of the data sets?
  2. And if I am interpreting these two perspectives correctly, then what does Numerai use for validation and test data?
  3. Or, is validation and test from Numerai’s point of view, covered by the data scientists validation and test data folding strategy?
  4. Is the tournament data encrypted live data?

Numerai’s ensemble model perspective:
training data = "numerai_training_data.csv"
validation data = ?
Test data = ?
Live data = “numerai_tournament_data.csv”

Data Scientist’s perspective:
Training data = subset of "numerai_training_data.csv"
Validation data = holdout subset of "numerai_training_data.csv"
Test data = holdout subset of "numerai_training_data.csv"
Live data = “numerai_tournament_data.csv”


The training data file is supposed to be used to build your model. Internally you probably want to use a holdout of this set for validation while optimizing your models, or use k-fold cross validation or something (as long as it’s aware of the era structure when making the folds).

The validation portion of the tournament data file is what is used to measure the performance of your model for the checks on the website. The ideal situation would be that you don’t use this data at all and only measure your performance on it after you’ve chosen your model to use.

The test portion of the tournament data file is presumably what uses as their validation data to create their meta model. We don’t have target information on that.

The live portion of the tournament data file is the data that neither us nor knows the targets for yet, it is what they are actually trading on and what our model’s performance is rated on 3 weeks after the end of the round.


Thanks. I am embarrassed to admit, that I only just noticed now after you pointed it out, that the new tournament file has three sets of data in it.

But now that I know this, would it not be better if we trained our models on the whole “numerai_training_data.csv” file, and then selected the validation and test sets out of the “numerai_training_data.csv” file, for our local validation and testing respectively?