Feature selection by Marcos Lopez de Prado


my experiments shows, that there is a hugh potential in reducing the numer of features I use for training. Gains can be as high as +0.5% CORR on the validation set. (Or higher if you do better that I do :slight_smile: )

Marcos Lopez de Prado describes the “Mean Descreas Accuracy” algorithm in his book “Advances in Financial Machine Learning”. Here is my code snippet that implements that algorithm.

> def MDA(model, features, testSet):
>     testSet['pred'] = model.predict(testSet[features])   # predict with a pre-fitted model on an OOS validation set
>     corr, std = num.numerai_score(testSet)  # save base scores
>     print("Base corr: ", corr)
>     diff = []
>     np.random.seed(42)
>     for col in features:   # iterate through each features
>         X = testSet.copy()
>         np.random.shuffle(X[col].values)    # shuffle the a selected feature column, while maintaining the distribution of the feature
>         testSet['pred'] = model.predict(X[features]) # run prediction with the same pre-fitted model, with one shuffled feature
>         corrX, stdX = num.numerai_score(testSet)  # compare scores...
>         print(col, corrX-corr)
>         diff.append((col, corrX-corr))
>     return diff

Simple, fast, elegant and it improves your models!
Have fun!


I’m confused, it looks like after running this you have list of:

How much better or worse randomly shuffling FeatureName affects CORR

How do you go about interpreting the results? Do you say something like: “Well, when I randomly shuffle FeatureWisdom13 my CORR on the validation set gets better. Therefore I should exclude it from my final model.”

I think I’m missing something because it seems to me that randomly shuffling an entire column would just make the overall model worse since it’s just adding in random noise.

1 Like

As I understand it, the question is, what happens if you “remove” one feature.
Shuffling is kind of a “remove”. It doesn’t require re-training, so it’s much faster.
An shuffling keeps the distribution of that feature.


Maybe you want to do random sampling, not shuffling?

Nope, it’s shuffling!
Reading the book is highly recommended. I’ve learnt a lot from it.

But obviously this is my interpretation and implementation of what’s in the book…

1 Like

Ok I think I get why shuffling could actually work here. The idea is that by shuffling you remove any signal from that feature and observe, whether the optimization metric suffers from that. If not, the feature might be irrelevant. If it does suffer, this feature should be important. You could also do np.random.random or any other method, I guess. It reminds me a lot of shap · PyPI.

my concern with feature engineering in this tournament is that we have no idea what the features are. during a burn, one of the features you’re removing could be the difference between a -20% loss and a -5% loss.


I get this now. Going to be a great idea when the feature list explodes. I was able to just drop one feature and make 310 models and evaluate in less than a day. So that seems better for now. There are some features that really help a lot but almost half the features (150) made corr better when dropped individually. Going to try to make a model dropping the top N worst features. The hurting features don’t lower corr nearly as much as the important features help though. So maybe this is insignificant. But I definitely want to be able to drop or combine features when the expanded set comes out.

1 Like

Thanks for sharing @nyuton, really interesting!
I read that part of the book and had some doubts:

  • At the end of the exercise, I assume you could end up with several shuffled features (all those that make corrX-corr negative)?
  • At first I thought this could lead to higher feature exposure but after thinking it twice, I guess it shouldn’t, as you are just sort of removing (shuffling) features that do not add to corr. Right?
  • Have you tested how consistent it is across eras?

Btw it’s great to have a community to discuss these kind of things! It’s easy to end up with doubts when reading the book.

Cross validation looks great.
Forward testing is in progress…

I implemented this method some time ago and confirm that it slightly improves the boosting models. In my case, I use MDA based on clustering variables to account for multicollinearity.


If I’ve read correctly, this is the “permutation feature importance” method. There’s an implementation in sklearn.

from sklearn.inspection import permutation_importance

Would be interesting to compare results, and worth trying as dropping even one or two features can make a marked improvement depending on the model.

I guess you are doing the following:

  1. select best features based on your validation set with MDA
  2. retrain your model only with the good features
  3. eval the performance on the validation (the same that was used in MDA)

I am wondering if this is leaking information from the validation set into the training and thereby the gain in validation CORR is an overestimation?

Hi Jay,

good point, this would be an information leakage.
But I do this process with all the models of my cross validation set.
It’s safer that way.

1 Like

Thank you for the clarification.
So you average the importance scores across all folds, only selecting features which perform well on all folds?
This will reduce the information leakage but it will not eliminate the leakage completely.

1 Like

How about doing this procedure on the training set? Yes, running predictions with training data, but only to check if the performance drops from the starting one (with no shuffling).

If the model is really using the feature, the performance will drop when shuffling that feature. That way there is no leakage.

Is there something obviously wrong with that thought?

Yes, I avearage the importance scores across all folds.
How does it have information leakage?

I’m interested in other points of views!
It produces such great results, that it’s too good to be true.
But I don’t see how I leak information…

Thanks, for pointing it out. I didn’t know about the sklearn implementation.

If you delete features on training CV and only after that check on val, you should be fine. As long as you check the final performance on a sample you have not optimised the feature selection on.

Just to make sure we are talking about the same things. I assume we have a dataset which we split into 5 folds and then we train on 4 folds and validate on 1 fold. We rotate this 5 times to have an entire cross validation.
If you have information leakage from validation into training in each of the 5 folds, then by averaging your importance scores you are diluting the information leakage, but you do not remove it completely.

I get the following results for 3 different experiments:

  1. Use validation data to get feature importance scores with MDA. Select most important features and retrain. Do this for each fold. Result: corr increases by 0.7% (a lot of leakage)
  2. Use validation data to get feature importance scores with MDA. Average the importance score across all 5 folds. Then select most important features and retrain. Do this for each fold. Result: corr increases by 0.5% (leakage is a bit diluted)
  3. Use training data to get feature importance scores with MDA. Average the importance score across all 5 folds. Then select most important features and retrain. Do this for each fold. Result: corr increases by 0.025% (no leakage for sure)

For these reasons I guess there is still leakage even if you average cross folds.

One thing I did not try is to split the training data into 2 sets. Train on the first set and do MDA on the second set. Then you still have your validation set for a validation without any leakage. Would be interesting to see the results of this.

1 Like