Back in November (2017) I found out that the training dataset was being repeated, week after week, for quite some while. I haven’t been downloading the datasets, so I don’t know if this is still the case. If it’s not (i.e., if training datasets now ARE changing, week by week) please just tell me so, and ignore the rest of this post.
From this point on, I’ll assume that training data is NOT changing, at least not EVERY week. The question that comes immediately to mind (I write two months later) is this: doesn’t that make the validation datasets of previous weeks, usable as extra training datasets for this week? I’ll use the name “retention” for this strategy (of training on previous weeks’ validation datasets) in the questions that follow.
Question 1: Is retention NOT advantageous, to the individual modeler? I’d find that quite counter-intuitive, but Numerai’s homomorphic encryption voodoo may make such things possible. If it indeed is NOT advantageous, I think Numerai would want to tell us so: then we’d all have more time to make good models, that contribute to their meta-model.
Question 2: Okay, if retention IS advantageous, to the individual modeler… is it ALSO advantageous to Numerai? I.e., does it improve the performance of their meta-model?
If “Yes”: then I think Numerai would want to keep all those previous validation datasets readily available to all of us, to make their meta-model that much better.
If “No” (i.e., retention HELPS individual modelers, but HURTS Numerai): then Numerai IS getting hurt by retention, because the retaining modelers will inevitably dominate the contest. So it would then behoove Numerai to start changing that training dataset, or at least re-encrypting all the datasets every week. Is the encryption too expensive?
I quit the slack channel (its busy intrusiveness annoyed me), so anyone is welcome to copy this post over there.
Thank for for reading all this!