Advice from the Kaggle which I've found very useful

Hi everyone!

I’ve found nice advice from the Kaggle’s grandmaster which can be easy fitted into the Numerai competition. Original version can be found here: https://www.kaggle.com/c/home-credit-default-risk/discussion/58332. And here is my adaption for Numerai:

  1. Since we’re provided historic data, this is partly a time-series problem. This means that recent data is more relevant than old data.
  2. There’s a lot of different regimes between the eras, which means that there’s a lot of variance between folds. Try different K-fold sets to see if your model is stable, and interpret the validation score as just one more fold. It could be an outlier, so TRUST YOUR LOCAL CV!!!
  3. Many of the features we’re given and that we generate are not relevant to the target and just confuse the model. LGB and XGB have a rich toolset to remove noisy features and regularize your models. Two of the most important for this competition are featurefraction and reglambda.
  4. As in all Kaggle competitions (and all machine learning problems, for that matter), the most important first step is to get a validation set-up that matches the test set. There’s no point in spending time on feature-engineering before your validation system is trustworthy.
  5. Have fun!

Regards,
Mark

23 Likes

@jackerparker what do you mean here by “get a validation set-up that matches the test set” ?

“Two of the most important for this competition are featurefraction and reglambda” - good suggestion. Especially the first one.