Numerapi v4 dataset

kayeffnumeraitor · September 5, 2022, 7:56pm

Try to optimize for consistently not having less than zero correlation, rather than for maximum correlation. The signal is remarkably weak, in fact it is so weak that I repeatedly questioned myself if there is any signal at all. Random targets will have correlation in the ballpark of ~ 0.00 +/- 0.01, while the signal is somewhere at 0.03 +/- 0.02. It takes a lot of experimentation, but I can confirm there is a signal to optimize for. The Jerome 20 day target seems to be rather good for corr, but does not work so well for TC for me.

What is also helping is to try to make your model independent against single features. Lets say you create predictions with your model with standard data, and with data where you replace one feature with random numbers. Both predictions should have a high correlation (> 0.95), otherwise your model is too dependent on a single feature. Reason for that is that Numerai drops most of the ~5000 assets you are creating predictions for, especially if the feature exposure is too high. Now imagine the feature that your model is so dependent on is actually the reason for excluding some of the stocks, making all of your models other predictions basically trash.

Also you have to take the train/test split in the official data with a grain of salt. Each era is in the v4 dataset is one week, 4 eras are one month, 12*4 = 48 eras is one year. IIRC, the train set is around 500-600 eras, and the test set is also around 500-600 eras which are roughly 10 years train data + 10 years test data. If you train your model on the train data only, you basically have a model from 2012 trying to predict stocks in 2022. What makes this even worse, is that there are at least 5 features, that have a high correlation with the target in the train set, but zero correlation in the test set (see this post).

And even after considering all of that, while your CORR and FNC might improve, TC can still be consistently bad.