Model Diagnostics: Feature Exposure

This post is about feature exposure. I’ll try explain the intuition behind feature exposure, and why it matters. I’ll also discuss ways to reduce feature exposure (regularization and feature neutralization).

Feature Exposure

The idea behind feature exposure is as follows: Any supervised ML model from a very high level perspective, is a function that takes an input feature vector (X) and outputs a prediction (y). At training time, the model learns a mapping between input features and the predictions. With the numerai data, the underlying process is non stationary. i.e features that have great predictive power in one era might not have any predictive power, or perhaps might even hurt the model’s performance in another era. A model that attributes too much importance to a small set of features might do well in the short run, but is unlikely to perform well in the long run. Feature exposure (more specifically, max feature exposure) is a measure of how well balanced a model’s exposure is to the features. Models with lower feature exposures tend to have more consistent performance over the long run.

For a real life example of this, I refer you to the massive burn in r223 on my primary account. The model that I’d used for that round was performing rather well on live data under another one of my accounts, before I decided to flip it over to my primary account. In hindsight that model was “overfit” on a limited set of features and when the regime changed, it began burning heavily. To conclude the anecdote, I switched back to a more conservative model from the next round onwards and everything was fine (at least for the next round). Bear in mind that it’s possible to train models with extremely low max feature exposure, which aren’t very useful in practice. There’s a trade off between feature exposure and correlation. Models with very low max feature exposure also tend to have low correlation. On the other hand, models with high max feature exposure will likely have higher corr, but are also more likely to burn in the long run.

The feature exposure metric has changed a bit since I last posted an implementation of it. We’ve gone from using Pearson correlation coefficient to using Spearman’s rank correlation coefficient (which is the same metric used for CORR). And instead of aggregating individual feature exposures with standard deviation, we’re now using root mean square as the aggregation function. Let’s start with a code snippet in Python to calculate maximum feature exposure, the new way. I know there are a lot of people here, who use R. I’d appreciate it if anyone proficient in R could post an R version of the snippet below in this thread.

import numpy as np
from scipy.stats import spearmanr

TOURNAMENT_NAME = "kazutsugi"
PREDICTION_NAME = f"prediction_{TOURNAMENT_NAME}"


def feature_exposures(df):
    feature_names = [f for f in df.columns
                     if f.startswith("feature")]
    exposures = []
    for f in feature_names:
        fe = spearmanr(df[PREDICTION_NAME], df[f])[0]
        exposures.append(fe)
    return np.array(exposures)


def max_feature_exposure(df):
    return np.max(np.abs(feature_exposures(df)))


def feature_exposure(df):
    return np.sqrt(np.mean(np.square(feature_exposures(df))))

Given the aformentioned changes in the feature exposure metrics, all previous heuristics we had about good feature exposures are no longer valid. The example model has a validation max feature exposure of 0.2905. That’s a reasonable benchmark to strive for, IMO. Although, it’s not difficult to do better than that (as we shall see in the section on feature neutralization below). :slight_smile:

Now let’s look at two models which have very similar in sample (training) sharpe, but slightly different training max feature exposures. NeuralNet8 and NeuralNet19 are two NN models with very similar in-sample (training) correlations (0.0407) and sharpe (1.09). But, they have slightly different in-sample max feature exposures (0.257 for NeuralNet8 and 0.325 for NeuralNet19, respectively). Let’s see how this difference affects their out of sample (validation) scores.

The model with the lower in-sample max feature exposure (NeuralNet8) seems to do better on out of sample corr and sharpe. You might also notice that the worse model (NeuralNet19) paradoxically seems to have lower out of sample max feature exposure. It’s always a good idea to look at both in-sample and out of sample max feature exposures while evaluating models.

This inverse correlation between max feature exposure and out of sample performance seems to generally hold true for all kinds of models. To illustrate the point, here are two regression plots comparing out of sample (validation) and in-sample (training) max feature exposures with out of sample sharpe. This is drawn from 80 different Gradient Boosted Tree and Neural Network models (provided by the Numerai team). There’s also a linear model and the example model thrown into the mix. The highest point (i.e the best performing model) in both plots unsurprisingly is the example model.

Reducing Feature Exposure with Regularization

Let’s try training the example model with L1 regularization and see if it has any effect on the model’s feature exposure. If you’re following along at home, you’ll need to edit the line where XGBRegressor instance is created to add an extra parameter alpha. I’m setting it to 0.1.

The specific line to change will go from this:

model = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.1)

To this:

model = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.1, alpha=0.1)

Let’s look at the validation results for the example model trained without the extra parameter.

And now for the model trained with L1 regularization.

As you can see, the model is mostly the same, the validation correlation is down by a bit and so is the validation sharpe, but the max feature exposure is also slightly lower. I haven’t tried to search for the optimal value of the hyperparameter alpha here. Searching for it will almost certainly lead to better results.
Also, there are many more regularization parameters that are worth exploring for XGBoost alone. And if you’re traing NNs, there’s a plethora of regularization parameters worth exploring.

Feature Neutralization

Numerai_Client

Yet another, stronger way to reduce feature exposures is to use feature neutralization.
Here’s a slightly simplified version of the neutralization code from the official analysis and tips notebook.

def neutralize(df, target="prediction_kazutsugi", by=None, proportion=1.0):
    if by is None:
        by = [x for x in df.columns if x.startswith('feature')]

    scores = df[target]
    exposures = df[by].values

    # constant column to make sure the series is completely neutral to exposures
    exposures = np.hstack((exposures, np.array([np.mean(scores)] * len(exposures)).reshape(-1, 1)))

    scores -= proportion * (exposures @ (np.linalg.pinv(exposures) @ scores.values))
    return scores / scores.std()

There’s quite a lot going on in the little snippet of code. Let me try to explain the important bits. The function takes a pandas DataFrame with features and predictions and returns a pandas Series with neutralized predictions.

  • On line 9, we’re taking matrix with the features from the DataFrame and concatenating another column to it, which has a constant value (the mean of the prediction column). This is to remove bias from the linear model on the next line.
  • On line 11, we’re computing the pseudo-inverse of the feature matrix from the previous line and multiplying this pseudo inverse with the predictions. This returns the coefficients for an OLS model fitted on the features.
  • On the same line, we then multiply the features with the coefficients, which returns the predictions of the linear model we just fitted.
  • We then multiply these linear predictions with a constant proportion (between 0 and 1) and subtract them from the original predictions.
  • Subtracting the linear predictions (of the original predictions) from the original predictions results in predictions that are less linear (fully non-linear if the proportion is set to 1) with respect to the features.
  • Finally we divide the output by it’s standard deviation to rescale it and return it.

If you read this far, you’re probably realized that feature neutralization is somehow related to feature exposures. And you’re right! Neutralizing the predictions with respect to the features reduces both feature exposure and max feature exposure. But they’re not exactly the same (@mdo has a great post explaining the difference). Let’s take the validation predictions from our old trusted example model and apply feature neutralization to it and see what happens. Sidenote: You might want to open this post in a second browser window and scroll one of them to the graphs from the unmodified example model above, to compare and contrast.

As you can see, feature exposure and max feature values have dropped dramatically (fe from 0.0850 to 0.0061 and max fe from 0.2955 to 0.0153). The validation correlation has dropped a bit (from 0.0291 to 0.0255) but the validation sharpe has gone up (from 0.9608 to 1.2436). The two burn eras era205 and era206 in the un-neutralized model have flipped and now have reasonable correlations. In the light of the improved sharpe ratio, it’s safe to conclude that neutralizing the predictions has made the model more consistent over the eras. Perhaps it’s also worthwhile trying to fine tune the proportion parameter. Another thing worth experimenting with is neutralizing predictions with respect to a subset of the feature groups instead of all the features. If you’d like to try this with your own models, the code to neutralize predictions is a one liner.

df["prediction_kazutsugi"] = neutralize(df)

Now, what would happen if we feature neutralize a linear model? Intuitively, subtracting linear predictions from a linear model should lead to a very bad model. Let’s try doing that and see what happens.

Firstly, we need to train a linear model. And the easiest way to do that IMO, would be to swap out the default tree based booster in the example model with a linear booster. It’s a really tiny change to the example model.

model = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.1, booster="gblinear")

Unsurprisingly, the linear model is worse than the example model in every possible way. It’s performing a bit better than I’d expected it to on val1 and much worse on val2. But, can we make it worse?

Sure we can!

Now that’s what I’d call a truly bad model. I’ve got two takeaways from this little experiment.

  1. Linear models are mediocre performers on average, but do surprisingly well on some eras.
  2. Neutralizing linear models makes them worse.

Feature exposure and feature neutralization are fairly complex topics which I don’t fully understand, yet. Writing this post has certainly clarified these concepts to a great degree in my mind. I’m quite certain that I’ve left out some important aspects of both in this post, please feel free to post any questions you have on this thread and I’ll try to answer them. And if I cannot, I’m sure someone from the team will. The feature neutralization meme was stolen from @Budbot’s post on #memes. Finally, I’d like to thank @master_key for all the ideas, encouragement and feedback while I was drafting this post. All errors remain mine.

Also, the code for drawing the (not so) pretty bar charts with validation corr and feature exposure is up on this gist.

46 Likes

Great post. Happy to have my meme stolen :smile:. Never thought of neutralising to select features :thinking:

2 Likes

Perhaps it’s worth looking into neutralizing against only the top-k highest exposed features. :slight_smile:

4 Likes

Great post @jrb. I tried your neutralization function in one of my codes and I got less exposure and SR boost on Val. However when trying to neutralize test set I run into memory issue. Any idea how to prevent that?

I’m glad you liked it, @jeremy_berros. Inverting large matrices is very expensive (both, in terms of CPU and memory). And if you’re doing this for the whole tournament data, it’s going to be an extremely large matrix with ~1.6 million rows x 311 columns (310 features + the mean column).

Unfortunately, there’s no easy way to get around it. The bottleneck is the call to numpy.linalg.pinv. There’s scipy.linalg.pinv which is a drop-in replacement for the numpy function that uses a linear least squares based solver, but that’ll be even more memory hungry. There’s also scipy.linalg.pinv2 which has similar performance characteristics to numpy.linalg.pinv (Both use SVD to compute the pseudo-inverse) but the performance difference between these two functions is negligible to non-existent.

I’d recommend using the python del statement to delete as many unused data structures from your program memory as possible to free up memory before calling the neutralize() function. It might also be worth calling gc.collect() to reclaim whatever little memory it can.

Another option would be to compute the coefficients on a smaller set (perhaps the validation set as you’ve already mentioned) and then using those coefficients on the larger set. But that wouldn’t be the same as feature neutralizing against the whole dataset. If you’d like to try this out, you’ll need to compute the (np.linalg.pinv(exposures) @ scores.values) part from the snippet on the smaller set and then keep the 311 dimensional vector that it returns (the coefficients) and use them to neutralize the bigger set. Let me know if any of these options work for you.

2 Likes

Thanks @jrb for your quick reply. I already tried del / gc.collect() but my 16GB RAM is crashing anyway. I did some feature selection based on min correlation which gives me feature exposure on val set around 0.06. Neutralizing with proportion of 0.5 brings val feat_exp down to ~0.02 and boosted val SR ~1.7. I am going to try chunking the test set and see what happens. I will keep you posted. Thanks again.

I tried memory profiling a quick and dirty script to neutralize the example predictions on my laptop with Python 3.7. The laptop is running OSX Catalina and has 64 GB of RAM. My intuition about memory usage seems to have been roughly correct. With numpy.linalg.pinv the memory usage peaks at 19886 MB, with scipy.linalg.pinv2 it peaks at 19882.8 MB, and with scipy.linalg.pinv, I had to manually kill the script after its RSS grew beyond 60 GiB.

Since the difference between the amount of physical memory you have and the peak observed memory usage isn’t too big, I’d recommend checking if you have swap enabled and adding a 8 GiB swap file, if you don’t have it already. This is very easy if you’re on Linux with mkswap, swapon and swapoff. IIRC, you should also be able to do this on Windows, where swap is called a page file.

I think it also makes sense to do neutralization separately for each era as the feature correlations keep changing across eras. This way there would also be no memory issues.

2 Likes

Yes, it doesn’t really make sense to me to neutralize EXCEPT on an era-by-era basis, which shouldn’t be a problem computationally.

2 Likes

Thanks @jrb @voidcentury @wigglemuse. Era by era neutralization makes sense and I confirm that memory is not an issue anymore using a function greatly inspired by analysis_and_tips.ipynb

def full_neutralization(df, feature_names, pred_name="prediction_kazutsugi"):
    df[pred_name] = df.groupby("era").apply(lambda x: normalize_and_neutralize(x, [pred_name], feature_names))
    scaled_preds = MinMaxScaler().fit_transform(df[[pred_name]])
    return scaled_preds

Now I need to spend some more time upstream on my feature engineering / selection :thinking:

6 Likes

Here’s that neutralization code for R:

neutralize <- function(scores_v,exposures_m,proportion=1.0) {
  scores_v <- scores_v - (proportion * (exposures_m %*% (MASS::ginv(exposures_m) %*% scores_v)))
  return( scores_v/sd(scores_v) )
}

normalize_vector <- function(v) {
  qnorm( (rank(v)-0.5) / length(v) )
}

normalize_matrix <- function(m) {
  qnorm( (Rfast::colRanks(m)-0.5) / nrow(m) )
}

normalize_and_neutralize <- function(scores_v,exposures_m,proportion=1.0) {
  scores_v <- normalize_vector(scores_v)
  exposures_m <- normalize_matrix(exposures_m)
  return( neutralize(scores_v,exposures_m,proportion) )
}

You’ll need “Rfast” package for colRanks function (note that there are other packages with same-named function). “MASS” should be included in any standard R installation. As I discussed with @jrb, I recommend you call “normalize_and_neutralize” rather than just “neutralize” – your results will be different (unless your data is already normalized in the same way) and probably better. The function is expecting a numeric vector for scores and a matrix (not a data.frame) for the exposures. This has some slight differences from the python version given in the tips notebook – namely the ranking functions are using the “average” method instead of the “first” method for breaking ties which makes more sense to me for this application (as “first” essentially introduces randomness which might help, but might hurt – both functions have a parameter to can set to “first” if you want though). [Also, don’t have ties in your predictions.] And I don’t think the python version actually normalizes the exposures, only the scores. Which is fine if the exposures matrix is the raw data or is otherwise standardized/normalized, but sometimes I am neutralizing with respect to other types of transformations of the data and it is just safer.

6 Likes

Here is a slightly different take on feature neutralization. Instead of finding a linear model of your predictions and subtracting a proportion of it off, we could instead find a linear model that when subtracted off reduces your feature exposure below a certain target. We could set a target and define a loss function such that when minimized all exposures will be less than or equal to the minimum of current exposure and the maximum desired exposure. So if some features have an exposure of 0.05, and you set a max exposure of 0.10, the features with the exposure of 0.05 won’t necessarily decrease as they would in the current neutralization code. This allows you to keep some of the smaller exposures that might be important, while reducing your largest risks. Test it out and let me know what you think! Be warned, it’s not especially fast…

import torch
from torch.nn import Linear
from torch.nn import Sequential
from torch.functional import F

def exposures(x, y):
    x = x - x.mean(dim=0)
    x = x / x.norm(dim=0)
    y = y - y.mean(dim=0)
    y = y / y.norm(dim=0)
    return torch.matmul(x.T, y)

def reduce_exposure(prediction, features, max_exp):
    # linear model of features that will be used to partially neutralize predictions
    lin = Linear(features.shape[1],  1, bias=False)
    lin.weight.data.fill_(0.)
    model = Sequential(lin)
    optimizer = torch.optim.Adamax(model.parameters(), lr=1e-4)
    feats = torch.tensor(np.float32(features)-.5)
    pred = torch.tensor(np.float32(prediction))
    start_exp = exposures(feats, pred[:,None])
    # set target exposure for each feature to be <= current exposure
    # if current exposure is less than max_exp, or <= max_exp if  
    # current exposure is > max_exp
    targ_exp = torch.clamp(start_exp, -max_exp, max_exp)

    for i in range(100000):
        optimizer.zero_grad()
        # calculate feature exposures of current linear neutralization
        exps = exposures(feats, pred[:,None]-model(feats))
        # loss is positive when any exposures exceed their target
        loss = (F.relu(F.relu(exps)-F.relu(targ_exp)) + F.relu(F.relu(-exps)-F.relu(-targ_exp))).sum()
        print(f'       loss: {loss:0.7f}', end='\r')
        if loss < 1e-7:
            neutralizer = [p.detach().numpy() for p in model.parameters()]
            neutralized_pred = pred[:,None]-model(feats)
            break
        loss.backward()
        optimizer.step()
    return neutralized_pred, neutralizer

def reduce_all_exposures(df, column, neutralizers=[],
                                     normalize=True,
                                     gaussianize=True,
                                     era_col="era",
                                     max_exp=0.1):
    unique_eras = df[era_col].unique()
    computed = []
    for u in unique_eras:
        print(u, '\r')
        df_era = df[df[era_col] == u]
        scores = df_era[column].values
        exposure_values = df_era[neutralizers].values
        
        if normalize:
            scores2 = []
            for x in scores.T:
                x = (scipy.stats.rankdata(x, method='ordinal') - .5) / len(x)
                if gaussianize:
                    x = scipy.stats.norm.ppf(x)
                scores2.append(x)
            scores = np.array(scores2)[0]

        scores, neut = reduce_exposure(scores, exposure_values, max_exp)

        scores /= scores.std()

        computed.append(scores.detach().numpy())

    return pd.DataFrame(np.concatenate(computed), columns=column, index=df.index)


TOURNAMENT_NAME = "kazutsugi"
PREDICTION_NAME = f"prediction_{TOURNAMENT_NAME}"

## Get output of your model
# data[PREDICTION_NAME] = model.predict(data[feature_names])

# reduce feature exposure in each era to max_exp
data_rfe_10 = reduce_all_exposures(data,
                                   [PREDICTION_NAME],
                                   neutralizers=feature_names,
                                   era_col="era",
                                   max_exp=0.10)

# replace prediction with reduced feature exposure prediction and rescale to [0,1]
data[PREDICTION_NAME] = data_rfe_10[PREDICTION_NAME]
data[PREDICTION_NAME] -= data[PREDICTION_NAME].min()
data[PREDICTION_NAME] /= data[PREDICTION_NAME].max()
13 Likes

Might not be a bad idea to have that officially replace the current code in the analysis and tips notebook.

1 Like

Have both. The former is more general and way faster. And you can do quick total neutralization as a point of comparison.

3 Likes

Here’s a command line feature neutralization script that I posted on RocketChat. Since it’s a standalone script, it should work regardless of whether you’re using Python to build your models or something else. It takes the tournament data and predictions files as the inputs and outputs a neutralized csv file.

Example usage:

# Fully neutralize predictions in example_predictions_target_kazutsugi.csv.xz wrt features in numerai_tournament_data.csv.xz
python neutralize.py numerai_tournament_data.csv.xz example_predictions_target_kazutsugi.csv.xz

Another example:

# Neutralize the top 10 highest exposed features by 50%
python neutralize.py -t 10 -p 0.5 numerai_tournament_data.csv.xz example_predictions_target_kazutsugi.csv.xz
6 Likes

Any way possible to keep my Colab from crashing while doing this?

You can restart your runtime and then load saved predictions as float32. This will work under 10GB of colab memory.

1 Like

It was pointed out to me that the R neutralization code I posted earlier in this thread (Model Diagnostics: Feature Exposure) doesn’t end up with values between [0,1]. That’s true – I left off that step at the end. So that’s normal, and you will need to do a minmax type rescaling to get the values into the proper ranger for submission. (The actual values you end up with in that range aren’t important as long as they remain in the same order.)

1 Like

Here is my minmax scaler function for those interested:

minmax <- function(x){(x-min(x))/(max(x)-min(x))}
1 Like

I try to fully understand the approach. I understand that we don’t want a bias term, but it is a bit unfamiliar for me to transform the data to a standard normal. Why do we need this?

With this line of code you first transform the input data (in our appliction the features) to a uniform distribution. Then you apply the standard normal quantile and get realizations from a standard normal. As you say, this is not always needed. What you want to have is just zero expectation (thus no bias), right? With the -0,5 you avoid the borders of the [0,1] interval, correct?

In normalize_vector you basically do the same.

With exposures_m %*% (MASS::ginv(exposures_m) you calculate the \beta of the linear model: scores = \beta * features.

Then finally you calculate score_{neutral} = scores - proportion * \beta* features and rescale score_{neutral}.

Is this understanding correct?

1 Like