New Data Format - Still not enough data


When I read about the new tournament format that there will be a new “time information” column, I thought: “Finally, now they include a timestamp”. Which is sadly not the case. The segmentation of the data in undefined “era” categories with no further explanion is not what I hoped for.

The thing is, since the numerai data is clearly some kind of stock data and stock data is almost always some kind of time series with autocorrelation, we just don’t get a significant part of the data which could help to better train our models. Without it, some (in my opinion) critical data is lost.

As asked before, is era a proxy for a specific time frame? Are the eras in a specific order?


A big focus of Numerai is making our data such that 1. you don’t know what’s going on with it 2. you don’t have to know what’s going on with it to be successful. A lot of people who do finance have (incorrect) a priori knowledge about what variables are good to use and what techniques are “sound”. By not giving users all the details about everything, we are trying to avoid these pitfalls.


I get why it is “good” for us, to not know what the features represent. But on the other hand I think in the specific case of time series, relevant data is lost or can’t be used in a meaningful way because we don’t know if a feature represents the time information (e.g. use sliding windows on data sets, detect concept draft, etc.). Even with the introduction of the “era” field, it does not change. Also the time information would be helpful when doing cross validation to prevent look-ahead bias. That being that, I’m no expert, but I would like to understand why exactly this information is kept from us :slight_smile: