I’ve found nice advice from the Kaggle’s grandmaster which can be easy fitted into the Numerai competition. Original version can be found here: https://www.kaggle.com/c/home-credit-default-risk/discussion/58332. And here is my adaption for Numerai:
- Since we’re provided historic data, this is partly a time-series problem. This means that recent data is more relevant than old data.
- There’s a lot of different regimes between the eras, which means that there’s a lot of variance between folds. Try different K-fold sets to see if your model is stable, and interpret the validation score as just one more fold. It could be an outlier, so TRUST YOUR LOCAL CV!!!
- Many of the features we’re given and that we generate are not relevant to the target and just confuse the model. LGB and XGB have a rich toolset to remove noisy features and regularize your models. Two of the most important for this competition are featurefraction and reglambda.
- As in all Kaggle competitions (and all machine learning problems, for that matter), the most important first step is to get a validation set-up that matches the test set. There’s no point in spending time on feature-engineering before your validation system is trustworthy.
- Have fun!