I’m new to Signals and am developing a model. I notice the train data is all quite old, up to about 2012, with validation being data since 2012. Is there a reason for this? Ideally I’d like to train on more recent data due to the type of signal I’m developing. Maybe I’m missing something here?
Unlike the main competition, the data in the historical file isn’t essential to use. The target you submit (a continuous value centered around 0.5) is based on the expected return between day 2 and day 6 after the Friday date, and you can use any data you like to come up with a signal. If you plan to derive from traditional OHLC/OHLCV market data, you might use a data source such as y-finance, and train/validate over any segments of the data you wish. Submitting values derived directly from some data without involving a model could also be effective, e.g. if you had a source of short term sentiment data, simply centering and scaling that might suffice.
Thanks for that. I guess if I don’t use the published validation data I won’t get any metrics from Numerai, but that doesn’t matter if I’ve done my own train/test splitting from data I’ve got from yfinance or elsewhere. Or would I get some metrics?
There are only diagnostics if submitting signals based on validation data. Depending on preference, that info could be useful to have as a comparison of models or merely an unnecessary complication. Unlike the main competition, validation takes a while to be produced for signals (up to 15 mins mentioned in the docs IIRC), so that might sway against using it. Purely personal preference though.