How does training data and validation data relate in "time"?

Complete n00b here.

I’ve been playing with the idea of engineering features that tell you something about the era of a point. After some toying, it seems to me that whatever function describes the different regimes of different eras is somewhat continuous on large enough scales. So me thinks nice, lets think more. When I look at the regimes in the validation set (naughty, I know), they seem to connect seamlessly to the last eras of the training set. Almost as if they are directly connected in time. Does anybody know if that might be true?

To qualify what I mean with regimes of eras: when you compute the covariance matrix of all features per era, you get a bunch of rather beautiful 310x310 pixel art. If you use matrix norm as metric, those pictures vary pretty continuously over the training and validation set.

Curious to hear what you think.

Yep. Training data eras 1-120 represent 10 years (1 era = 1 month). So val eras 121-132 are simply the following year. Then a large gap to other validation eras 197-212 which are from fairly recent times.

I see, that makes a lot of sense. Thanks! Is this documented somewhere that I missed?

Unfortunately most stuff like this isn’t documented. You can pick this kinda stuff up though from Arbitrage’s office hours. I just watched every video on the Numerai YouTube channel, takes some time but I would recommend it, just watch them while your models are training :slight_smile:

Here’s what I’ve found (take with a grain of salt):

If we assume that era1 is January 2003, then the monthly data that we get is Jan 03 - Dec 12 in train and Jan 13 - Dec 13 & Jun 19 - Sep 20 in val. When I used this month variable in my models, it consistently improved performance by 5-10%.
But what about weeks? If numbering is consistent, for weeks we should know exactly when they start and when they end, since live eras become test eras after completion. If this is true, era575 is week 2 2014 (starting January 9, ending February 5), and so on.
Unfortunately, I think I never managed to bring these findings together, and my models including month as a variable performed poorly on live. I may have made a mistake. Numerai’s numbering might be inconsistent. Or since weeks often overlap months, the performance boost on monthly data may not carry over to weekly data.

In any way, I think Numerai should provide more information to let us try to find some temporal relations. And give us weekly data to train and validate.

I quite agree, at least in some ways.FWIW, I really like your idea of turning the covariance matrices into pixel art. Maybe Numerai should produce NFTs from the covariance matrices for each live era - a big one for the over all round winner, smaller ones from the matrices of combinations of feature groups - and award them to the high scorers? I digress.

Anyway, I’m personally intrigued by how each era relates to the others; so I’ve been exploring that over in the the Analyzing Training Data thread. If you look at the evolution of the averaged STD for the feature groups Charisma and Strength from the first training set through to the live round there’s an interesting trend.

yes, I should definitely listen to the podcast more. Again, I am new and there is so much left for me to explore.
I think numbering months and weeks is somewhat problematic since the era length is not constant at all. There might be ways to mitigate that though.

here is a pic of the “average covariance” of the training data I came up with btw.

4-4-5 calendar may be useful to align week/month columns

Interresting, thanks :slight_smile: . I’d really like to get a way to add a time variable to the mix since it so clearly works in the training/val data.