Re-use of the Training Dataset: What (if anything) to do about it?


Back in November (2017) I found out that the training dataset was being repeated, week after week, for quite some while. I haven’t been downloading the datasets, so I don’t know if this is still the case. If it’s not (i.e., if training datasets now ARE changing, week by week) please just tell me so, and ignore the rest of this post.

From this point on, I’ll assume that training data is NOT changing, at least not EVERY week. The question that comes immediately to mind (I write two months later) is this: doesn’t that make the validation datasets of previous weeks, usable as extra training datasets for this week? I’ll use the name “retention” for this strategy (of training on previous weeks’ validation datasets) in the questions that follow.

Question 1: Is retention NOT advantageous, to the individual modeler? I’d find that quite counter-intuitive, but Numerai’s homomorphic encryption voodoo may make such things possible. If it indeed is NOT advantageous, I think Numerai would want to tell us so: then we’d all have more time to make good models, that contribute to their meta-model.

Question 2: Okay, if retention IS advantageous, to the individual modeler… is it ALSO advantageous to Numerai? I.e., does it improve the performance of their meta-model?

If “Yes”: then I think Numerai would want to keep all those previous validation datasets readily available to all of us, to make their meta-model that much better.

If “No” (i.e., retention HELPS individual modelers, but HURTS Numerai): then Numerai IS getting hurt by retention, because the retaining modelers will inevitably dominate the contest. So it would then behoove Numerai to start changing that training dataset, or at least re-encrypting all the datasets every week. Is the encryption too expensive?

I quit the slack channel (its busy intrusiveness annoyed me), so anyone is welcome to copy this post over there.

Thank for for reading all this!


Validation and Test data also don’t change. Only the Live data changes week to week.


Thanks for that clarification.

I’m surprised that Numerai doesn’t compel more frequent re-training of our models. Wouldn’t models trained on fresher data, yield for them a better performing meta-model?

I would read with great interest, any reply from Numerai personnel.


Yes, I would imagine fresher and more data would usually be better, so I’m not really sure why we don’t get previous competitions’ live data regularly included into the dataset. Since the people at Numerai aren’t stupid and know that more data would give us a better chance of finding signal in the data, I assume they have a good reason for not updating the data more often, but I have no idea what that reason is.