Advice from the Kaggle which I've found very useful

jackerparker · May 1, 2020, 10:11am

Hi everyone!

I’ve found nice advice from the Kaggle’s grandmaster which can be easy fitted into the Numerai competition. Original version can be found here: https://www.kaggle.com/c/home-credit-default-risk/discussion/58332. And here is my adaption for Numerai:

Since we’re provided historic data, this is partly a time-series problem. This means that recent data is more relevant than old data.
There’s a lot of different regimes between the eras, which means that there’s a lot of variance between folds. Try different K-fold sets to see if your model is stable, and interpret the validation score as just one more fold. It could be an outlier, so TRUST YOUR LOCAL CV!!!
Many of the features we’re given and that we generate are not relevant to the target and just confuse the model. LGB and XGB have a rich toolset to remove noisy features and regularize your models. Two of the most important for this competition are featurefraction and reglambda.
As in all Kaggle competitions (and all machine learning problems, for that matter), the most important first step is to get a validation set-up that matches the test set. There’s no point in spending time on feature-engineering before your validation system is trustworthy.
Have fun!

Regards,
Mark

chelnak · May 19, 2021, 8:10pm

@jackerparker what do you mean here by “get a validation set-up that matches the test set” ?

autratec · June 14, 2021, 12:45am

“Two of the most important for this competition are featurefraction and reglambda” - good suggestion. Especially the first one.

Topic		Replies	Views
Overfitting to Validation Data Data Science	13	1722	July 8, 2021
Cross-validation done right Data Science	4	2296	May 2, 2021
ShatteredX's Improved & Compact Feature Set (225 features) for v4.3 Midnight Data Data Science	13	3122	March 7, 2024
Which Model is Better? Tournament	44	2624	January 27, 2022
16GB Intermediate solution: XGB Era Boosting Tournament	54	5421	April 1, 2022