ShatteredX's Improved & Compact Feature Set (225 features) for v4.3 Midnight Data

I have made my own feature set that is a subset of the v4.3 feature set. Here is the link:

The goal of this experiment was to create a feature set with the fewest features while also maximizing CORR and MMC. I wanted to create a feature set smaller than the Numerai medium set contained in features.json but also better. The most practical application of this feature set is it allows users to train models using less RAM, which is a common roadblock for new users.

The methods I used to select the features were pretty simple:

  1. Built-in feature importances of lightgbm/xgboost.
  2. SHAP feature importances.
  3. Brute-force evaluation vs. the validation set.

(I will say at the start here: yes, this feature set is “overfit” to the validation set. I did not train on validation but I did evaluate against validation many times. Does this feature set still have value? That is for you to decide.)

Overall, I am pleased with the results. Here are the cumulative CORR20v2 results using the “example model” trained on eras 1-561 (downsampled to every 4th era) on target_cyrus_v4_20:

model = LGBMRegressor(n_estimators=2000, max_depth=5, learning_rate=0.01, colsample_bytree=0.1, num_leaves=2**5+1)

(I just noticed num_leaves should be 2**5-1 instead of 2**5+1 to match the example model, a minor difference).

Evaluated vs ALL validation eras 575-1092:

Suprisingly, my compact feature set of 225 features gets higher CORR than the full v4.3 feature set! In fact, all the metrics are better, including sharpe and even feature exposure (barely).

Here are the diagnostics of each feature set:

My feature set (225)

All v4.3 features (2,376)

Medium v4.3 features (705)

Small v4.3 features (42)

Will these features continue to beat the full feature set? Who knows, only time will tell. If you forced me to choose to stake on one, I would probably still choose the full feature set, but it will interesting to see how they perform going forward. Good luck!


It might be interesting to see how many of each feature_set ShatteredX chose to include.

Some observations

  1. It’s a pretty diverse/balanced set which is healthy
  2. It agrees nicely with our small feature set which we made quite awhile ago now (before sunshine/rain/midnight), and was our attempt to make a super super compact dataset
  3. There’s a skew towards more recent releases like midnight and rain, which makes sense because we tried to make those more compact ourselves
  4. Poor charisma and strength :frowning:

Perhaps an obvious extension for the ambitious/compute-rich/perfectionists who aim to avoid the validation overfitting problem ShatteredX mentions.

  • Turn this into a system which always picks the best 225 or so features given any data provided.
  • Walk-forward every 50 or so eras, re-evaluate your favorite 225 features (up to era X, let’s say)
  • Train on that data up era X, and predict on eras from X+5 (for embargo) to X+50.
  • Now you have an entire dataset of valid out-of-sample predictions
  • Check if those predictions are better than just training on all features in the same walk-forward approach!
  • Try not to iterate on this more than a couple of times or else you’re back in overfitting territory

but you can avoid over-fitting by using a different model in the feature engineering step and for validation/prediction?

1 Like

@halsmith99 Yeah I thought the same. I did use two different models, not the lightgbm one, to perform the brute-force feature evaluation.

@master_key That is really cool to see the proportion it shares with the other feature groups!

I also forgot to mention that my original inspiration for this experiment was @mdo 's BorutaShap thread where he created the small feature set. Feature Selection with BorutaShap So I have been thinking about this idea for two years now apparently :joy: :older_man:


Why would using two different models avoid over-fitting?

If you give any model only the features that work for the majority of the validation period then it’s going to look great on validation guaranteed


Yeah still massively overfit to the validation set. I guess the idea would be that at least it might not be overfit to specific hyperparameters or tree library.

1 Like

the idea i had was if i apply the feature set to a different model that comes up with a significantly different feature exposure distribution

then it is learning different patterns from the data set and may be less overfit than using the feature engineering model out of sample

ran it on the 20 day targets using a cat, lgb & xgb model hypertuned on an older dataset/target (v3/4 & nomi) and there is improvement across the board.

but i haven’t checked the feature exposures.

I’ve been experimenting with the v4.3 data set and it seems that the last 100 or so new features added with Midnight don’t amount to much, at least when run on the validation data with a genetic algorithm using a soft limit of around 300 features used in any given model.

This is a sample plot of the feature utility function through 4K iterations (a complete evaluation takes around 20-30K):

the lower plot is done with a 51 element centered moving median over the data in the upper plot.

Do the new features have a lot of NaNs (or 2s) in the final columns which would cause that sort of problem? I don’t know yet.

I used some of my feature selection ideas and adjusted the subset slightly (removed 29 + added 50 Features) and trained a lightgbm model (using the above params):

If you are interested in the ideas behind it - you should attend the Numerai Meetup in Frankfurt :upside_down_face:. I will be giving a talk there on how I do feature selection to train my Neural Nets :v:.


Thank you for sharing this subset @shatteredx. I will use it !
I plan to refresh some of my very old models (and some more fun).

Another trick maybe if you have not much RAM is to use CatBoost. It seems it handles int8 where other are converting to float32. I’m not 100% sure but it’s mentionned here and there (for example, in this article: Machine Learning Tricks to Optimize CatBoost Performance Up to 4x)

1 Like

early days but initial live results look encouraging.

lazer_02 using all features, lazer_u0013/14 using the reduced shatteredx feature set.

all are xgb ensembles on the same parameters using upto a dozen 20d targets.


The fact that you can outperform a model with 2300+ features with a model that uses only a subset of 225 of those features can mean two things imho:

  • most features contain no or redundant information
  • the model has real trouble extracting the information contained in the full feature set.

@shatteredx. what’s your opinion on this?

1 Like