Feature selection by Marcos Lopez de Prado

How do you decide which features to drop? Just the “worst” one and start from the start? Or all negative ones after one loop? Do you check for some sort of significance (possibly corrected for multiple hypotheses testing)? Otherwise, I guess you’ll always find negative results just by chance.

Good point! I’ve used the second version so far. I’ll try the 3rd as well!

Hi,
thx for the article.
Just a question: what would be a good contender for “num.numerai_score”?
Is there a line of code/github-reference I could use?

When you download the example data, there is a file analysis_and_tips.ipynb. It contains a numerai_score method if I remember correctly. You might have to adapt it a bit to fit into the code of nyuton.

I simply order the features by importance and then for example drop the lowest 30%. Very straight forward.

It’s just the standard correlation score.
You can get the code here: GitHub - nemethpeti/numerai

Have you compared results of permutation importance with xgboost’s reporting of feature importance? i.e.
feature_importances_ or get_booster().get_score(importance_type=) Wondering if there’s correlation with any of XGB’s importance types.

There should be some correlation, but it’s not the same thing.

Thank you for sharing!
I am a huge believer in peer review that is why I would like to share my take on your feature selection.

I’m currently using following algorithm that was derived from the chat feedback:

* split data to 5 folds CV (sequentially)
* for each split:
   o find optimal parameters for a model using another 5 fold CV
   o train model with the found parameters on all features and measure the "base correlation" with the target
   o mark all features as non-selected
   o for each non-selected feature repeat X (X=5) times:
       * shuffle the feature's column and measure a correlation with the target
       * if the correlation is greater than the "base correlation" break the repeat
   o all features that pass the repeated test are marked as selected features
   
* return the selected features

The idea is that CV should prevent overfitting to validation data and the repeated test should mitigate the effects of chance.

IMHO (not based on any research or math):

  • If at least one split of the CV uses a feature then the feature should be included.
  • If the test fails at least once (the correlation is higher) the model does not rely on the feature.

Aside from that, I’ve tried non-iterative approach particle swarm optimization (PSO) to make a feature selection faster. It did not work well. The computation was a little bit faster, but the performance was much worse. Did anybody has better experience?

Thank you for any feedback!

this is a very helpful post that you started, and very generous.
all my models have improved as a very direct result.

you likely have seen it, but if not I found the scikit-learn article titled
“Permutation of Feature Importance” (link below) to be insightful.

two notes from the article: 1. the MDA method requires a good model. in other words a bad model will discard good features (maybe obvious but good to remember), and 2. “When two features are correlated and one of the features is permuted, the model will still have access to the feature through its correlated feature. This will result in a lower importance value for both features, where they might actually be important.” This was not obvious - at least not to me.

and this second point cannot be overlooked. sklean says " One way to handle this is to cluster features that are correlated and only keep one feature from each cluster." The question is how to make that choice and if you compare MDA runs, what if some feat choices discard different features. That requires a good deal of training of the model with differing results… I stumbled on another approach (when I discovered that a whole group of what I believed were top features were the first ones discarded by MDA). I first finish the MDA run, getting down to the MDA subset of optimal features, and then add back - one at a time - each feature from the highly correlated groups to the optimal set of MDA features, re train and test… to find the new optimal set. not only is there a nice bump in corr when a important feat is reinstated, but also have found that sometimes one of the previously (early) discarded high corr’d features jumps to the top of ranked feat importance and MDA importance.

by the way I also run MDA to optimize for Sharpe - early numbers show some promise.

here’s the link: 4.2. Permutation feature importance — scikit-learn 0.24.2 documentation

and only found out about this helpful article when *Minou posted about it above…
also very generous of him.

1 Like