Feature selection by Marcos Lopez de Prado

jimmy_woodford · May 8, 2021, 7:48am

How do you decide which features to drop? Just the “worst” one and start from the start? Or all negative ones after one loop? Do you check for some sort of significance (possibly corrected for multiple hypotheses testing)? Otherwise, I guess you’ll always find negative results just by chance.

nyuton · May 8, 2021, 7:59am

Good point! I’ve used the second version so far. I’ll try the 3rd as well!

jamesjoyce · May 8, 2021, 11:23am

Hi,
thx for the article.
Just a question: what would be a good contender for “num.numerai_score”?
Is there a line of code/github-reference I could use?

jay1100 · May 10, 2021, 7:15am

When you download the example data, there is a file analysis_and_tips.ipynb. It contains a numerai_score method if I remember correctly. You might have to adapt it a bit to fit into the code of nyuton.

jay1100 · May 10, 2021, 7:17am

I simply order the features by importance and then for example drop the lowest 30%. Very straight forward.

nyuton · May 10, 2021, 7:45am

It’s just the standard correlation score.
You can get the code here: GitHub - nemethpeti/numerai

minou · May 10, 2021, 3:50pm

Have you compared results of permutation importance with xgboost’s reporting of feature importance? i.e.
feature_importances_ or get_booster().get_score(importance_type=) Wondering if there’s correlation with any of XGB’s importance types.

nyuton · May 24, 2021, 6:47pm

There should be some correlation, but it’s not the same thing.

sneaky · June 6, 2021, 6:06pm

Thank you for sharing!
I am a huge believer in peer review that is why I would like to share my take on your feature selection.

I’m currently using following algorithm that was derived from the chat feedback:

* split data to 5 folds CV (sequentially)
* for each split:
   o find optimal parameters for a model using another 5 fold CV
   o train model with the found parameters on all features and measure the "base correlation" with the target
   o mark all features as non-selected
   o for each non-selected feature repeat X (X=5) times:
       * shuffle the feature's column and measure a correlation with the target
       * if the correlation is greater than the "base correlation" break the repeat
   o all features that pass the repeated test are marked as selected features
   
* return the selected features

The idea is that CV should prevent overfitting to validation data and the repeated test should mitigate the effects of chance.

IMHO (not based on any research or math):

If at least one split of the CV uses a feature then the feature should be included.
If the test fails at least once (the correlation is higher) the model does not rely on the feature.

Aside from that, I’ve tried non-iterative approach particle swarm optimization (PSO) to make a feature selection faster. It did not work well. The computation was a little bit faster, but the performance was much worse. Did anybody has better experience?

Thank you for any feedback!

neliz · July 8, 2021, 7:35am

thank m8, going to give it a try!

neliz · July 8, 2021, 9:11am

Noob question: how can you implement this before doing a hyper parameter sweep?

If you do a model.predict, than your hyper parameters should already be defined correct?

johnnywhippet · July 24, 2021, 6:09pm

Good spot. I’m just about to get cracking with
This.

https://scikit-learn.org/stable/modules/permutation_importance.html

halsmith99 · July 29, 2021, 5:17am

hi nyuton,

looking to try this with the new data set coming.

have your live results backed up the experimental results?

nyuton · August 20, 2021, 10:00am

Hi!

If you liked this post and would like to buy actual good performing models, you can do it now at NumerBay.ai!
Two of my models are available here.

kedoink · April 3, 2024, 3:16am

are you able to add this to the pickle or are you running and uploading manually or via webhook?

Topic		Replies	Views
Advice from the Kaggle which I've found very useful Data Science	2	2758	June 14, 2021
Feature neutralization workflow Data Science	6	5483	February 24, 2021
Liz Experiment Review Q1 2021 : Generating Features and Applying Feature Neutralization Tournament	24	5211	May 11, 2021
Feature Selection with BorutaShap Data Science	16	12368	January 17, 2023
Model ranked low....predictions CSV comparison? Tournament	5	800	February 8, 2021

Feature selection by Marcos Lopez de Prado

Related topics