This post is about feature exposure. I’ll try explain the intuition behind feature exposure, and why it matters. I’ll also discuss ways to reduce feature exposure (regularization and feature neutralization).
Feature Exposure
The idea behind feature exposure is as follows: Any supervised ML model from a very high level perspective, is a function that takes an input feature vector (X) and outputs a prediction (y). At training time, the model learns a mapping between input features and the predictions. With the numerai data, the underlying process is non stationary. i.e features that have great predictive power in one era might not have any predictive power, or perhaps might even hurt the model’s performance in another era. A model that attributes too much importance to a small set of features might do well in the short run, but is unlikely to perform well in the long run. Feature exposure (more specifically, max feature exposure) is a measure of how well balanced a model’s exposure is to the features. Models with lower feature exposures tend to have more consistent performance over the long run.
For a real life example of this, I refer you to the massive burn in r223 on my primary account. The model that I’d used for that round was performing rather well on live data under another one of my accounts, before I decided to flip it over to my primary account. In hindsight that model was “overfit” on a limited set of features and when the regime changed, it began burning heavily. To conclude the anecdote, I switched back to a more conservative model from the next round onwards and everything was fine (at least for the next round). Bear in mind that it’s possible to train models with extremely low max feature exposure, which aren’t very useful in practice. There’s a trade off between feature exposure and correlation. Models with very low max feature exposure also tend to have low correlation. On the other hand, models with high max feature exposure will likely have higher corr, but are also more likely to burn in the long run.
The feature exposure metric has changed a bit since I last posted an implementation of it. We’ve gone from using Pearson correlation coefficient to using Spearman’s rank correlation coefficient (which is the same metric used for CORR
). And instead of aggregating individual feature exposures with standard deviation, we’re now using root mean square as the aggregation function. Let’s start with a code snippet in Python to calculate maximum feature exposure, the new way. I know there are a lot of people here, who use R. I’d appreciate it if anyone proficient in R could post an R version of the snippet below in this thread.
import numpy as np
from scipy.stats import spearmanr
TOURNAMENT_NAME = "kazutsugi"
PREDICTION_NAME = f"prediction_{TOURNAMENT_NAME}"
def feature_exposures(df):
feature_names = [f for f in df.columns
if f.startswith("feature")]
exposures = []
for f in feature_names:
fe = spearmanr(df[PREDICTION_NAME], df[f])[0]
exposures.append(fe)
return np.array(exposures)
def max_feature_exposure(df):
return np.max(np.abs(feature_exposures(df)))
def feature_exposure(df):
return np.sqrt(np.mean(np.square(feature_exposures(df))))
Given the aformentioned changes in the feature exposure metrics, all previous heuristics we had about good feature exposures are no longer valid. The example model has a validation max feature exposure of 0.2905
. That’s a reasonable benchmark to strive for, IMO. Although, it’s not difficult to do better than that (as we shall see in the section on feature neutralization below).
Now let’s look at two models which have very similar in sample (training) sharpe, but slightly different training max feature exposures. NeuralNet8 and NeuralNet19 are two NN models with very similar in-sample (training) correlations (0.0407) and sharpe (1.09). But, they have slightly different in-sample max feature exposures (0.257
for NeuralNet8 and 0.325
for NeuralNet19, respectively). Let’s see how this difference affects their out of sample (validation) scores.
The model with the lower in-sample max feature exposure (NeuralNet8) seems to do better on out of sample corr and sharpe. You might also notice that the worse model (NeuralNet19) paradoxically seems to have lower out of sample max feature exposure. It’s always a good idea to look at both in-sample and out of sample max feature exposures while evaluating models.
This inverse correlation between max feature exposure and out of sample performance seems to generally hold true for all kinds of models. To illustrate the point, here are two regression plots comparing out of sample (validation) and in-sample (training) max feature exposures with out of sample sharpe. This is drawn from 80 different Gradient Boosted Tree and Neural Network models (provided by the Numerai team). There’s also a linear model and the example model thrown into the mix. The highest point (i.e the best performing model) in both plots unsurprisingly is the example model.
Reducing Feature Exposure with Regularization
Let’s try training the example model with L1 regularization and see if it has any effect on the model’s feature exposure. If you’re following along at home, you’ll need to edit the line where XGBRegressor
instance is created to add an extra parameter alpha
. I’m setting it to 0.1
.
The specific line to change will go from this:
model = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.1)
To this:
model = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.1, alpha=0.1)
Let’s look at the validation results for the example model trained without the extra parameter.
And now for the model trained with L1 regularization.
As you can see, the model is mostly the same, the validation correlation is down by a bit and so is the validation sharpe, but the max feature exposure is also slightly lower. I haven’t tried to search for the optimal value of the hyperparameter alpha
here. Searching for it will almost certainly lead to better results.
Also, there are many more regularization parameters that are worth exploring for XGBoost alone. And if you’re traing NNs, there’s a plethora of regularization parameters worth exploring.
Feature Neutralization
Yet another, stronger way to reduce feature exposures is to use feature neutralization.
Here’s a slightly simplified version of the neutralization code from the official analysis and tips notebook.
def neutralize(df, target="prediction_kazutsugi", by=None, proportion=1.0):
if by is None:
by = [x for x in df.columns if x.startswith('feature')]
scores = df[target]
exposures = df[by].values
# constant column to make sure the series is completely neutral to exposures
exposures = np.hstack((exposures, np.array([np.mean(scores)] * len(exposures)).reshape(-1, 1)))
scores -= proportion * (exposures @ (np.linalg.pinv(exposures) @ scores.values))
return scores / scores.std()
There’s quite a lot going on in the little snippet of code. Let me try to explain the important bits. The function takes a pandas DataFrame with features and predictions and returns a pandas Series with neutralized predictions.
- On line 9, we’re taking matrix with the features from the DataFrame and concatenating another column to it, which has a constant value (the mean of the prediction column). This is to remove bias from the linear model on the next line.
- On line 11, we’re computing the pseudo-inverse of the feature matrix from the previous line and multiplying this pseudo inverse with the predictions. This returns the coefficients for an OLS model fitted on the features.
- On the same line, we then multiply the features with the coefficients, which returns the predictions of the linear model we just fitted.
- We then multiply these linear predictions with a constant
proportion
(between0
and1
) and subtract them from the original predictions. - Subtracting the linear predictions (of the original predictions) from the original predictions results in predictions that are less linear (fully non-linear if the proportion is set to
1
) with respect to the features. - Finally we divide the output by it’s standard deviation to rescale it and return it.
If you read this far, you’re probably realized that feature neutralization is somehow related to feature exposures. And you’re right! Neutralizing the predictions with respect to the features reduces both feature exposure and max feature exposure. But they’re not exactly the same (@mdo has a great post explaining the difference). Let’s take the validation predictions from our old trusted example model and apply feature neutralization to it and see what happens. Sidenote: You might want to open this post in a second browser window and scroll one of them to the graphs from the unmodified example model above, to compare and contrast.
As you can see, feature exposure and max feature values have dropped dramatically (fe
from 0.0850
to 0.0061
and max fe
from 0.2955
to 0.0153
). The validation correlation has dropped a bit (from 0.0291
to 0.0255
) but the validation sharpe has gone up (from 0.9608
to 1.2436
). The two burn eras era205
and era206
in the un-neutralized model have flipped and now have reasonable correlations. In the light of the improved sharpe ratio, it’s safe to conclude that neutralizing the predictions has made the model more consistent over the eras. Perhaps it’s also worthwhile trying to fine tune the proportion parameter. Another thing worth experimenting with is neutralizing predictions with respect to a subset of the feature groups instead of all the features. If you’d like to try this with your own models, the code to neutralize predictions is a one liner.
df["prediction_kazutsugi"] = neutralize(df)
Now, what would happen if we feature neutralize a linear model? Intuitively, subtracting linear predictions from a linear model should lead to a very bad model. Let’s try doing that and see what happens.
Firstly, we need to train a linear model. And the easiest way to do that IMO, would be to swap out the default tree based booster in the example model with a linear booster. It’s a really tiny change to the example model.
model = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.1, booster="gblinear")
Unsurprisingly, the linear model is worse than the example model in every possible way. It’s performing a bit better than I’d expected it to on val1
and much worse on val2
. But, can we make it worse?
Sure we can!
Now that’s what I’d call a truly bad model. I’ve got two takeaways from this little experiment.
- Linear models are mediocre performers on average, but do surprisingly well on some eras.
- Neutralizing linear models makes them worse.
Feature exposure and feature neutralization are fairly complex topics which I don’t fully understand, yet. Writing this post has certainly clarified these concepts to a great degree in my mind. I’m quite certain that I’ve left out some important aspects of both in this post, please feel free to post any questions you have on this thread and I’ll try to answer them. And if I cannot, I’m sure someone from the team will. The feature neutralization meme was stolen from @Budbot’s post on #memes. Finally, I’d like to thank @master_key for all the ideas, encouragement and feedback while I was drafting this post. All errors remain mine.
Also, the code for drawing the (not so) pretty bar charts with validation corr and feature exposure is up on this gist.