Feature selection by Marcos Lopez de Prado

How do you decide which features to drop? Just the “worst” one and start from the start? Or all negative ones after one loop? Do you check for some sort of significance (possibly corrected for multiple hypotheses testing)? Otherwise, I guess you’ll always find negative results just by chance.

Good point! I’ve used the second version so far. I’ll try the 3rd as well!

thx for the article.
Just a question: what would be a good contender for “num.numerai_score”?
Is there a line of code/github-reference I could use?

When you download the example data, there is a file analysis_and_tips.ipynb. It contains a numerai_score method if I remember correctly. You might have to adapt it a bit to fit into the code of nyuton.

I simply order the features by importance and then for example drop the lowest 30%. Very straight forward.

It’s just the standard correlation score.
You can get the code here: GitHub - nemethpeti/numerai

Have you compared results of permutation importance with xgboost’s reporting of feature importance? i.e.
feature_importances_ or get_booster().get_score(importance_type=) Wondering if there’s correlation with any of XGB’s importance types.

There should be some correlation, but it’s not the same thing.

Thank you for sharing!
I am a huge believer in peer review that is why I would like to share my take on your feature selection.

I’m currently using following algorithm that was derived from the chat feedback:

* split data to 5 folds CV (sequentially)
* for each split:
   o find optimal parameters for a model using another 5 fold CV
   o train model with the found parameters on all features and measure the "base correlation" with the target
   o mark all features as non-selected
   o for each non-selected feature repeat X (X=5) times:
       * shuffle the feature's column and measure a correlation with the target
       * if the correlation is greater than the "base correlation" break the repeat
   o all features that pass the repeated test are marked as selected features
* return the selected features

The idea is that CV should prevent overfitting to validation data and the repeated test should mitigate the effects of chance.

IMHO (not based on any research or math):

  • If at least one split of the CV uses a feature then the feature should be included.
  • If the test fails at least once (the correlation is higher) the model does not rely on the feature.

Aside from that, I’ve tried non-iterative approach particle swarm optimization (PSO) to make a feature selection faster. It did not work well. The computation was a little bit faster, but the performance was much worse. Did anybody has better experience?

Thank you for any feedback!

thank m8, going to give it a try!

Noob question: how can you implement this before doing a hyper parameter sweep?

If you do a model.predict, than your hyper parameters should already be defined correct?

Good spot. I’m just about to get cracking with


hi nyuton,

looking to try this with the new data set coming.

have your live results backed up the experimental results?


If you liked this post and would like to buy actual good performing models, you can do it now at NumerBay.ai!
Two of my models are available here.

are you able to add this to the pickle or are you running and uploading manually or via webhook?