What exactly is neutralization?

The final stage of the data pipeline is the data neutralization, however it’s a technique I’ve not come across and googling the term “data neutralization” leads to nothing I can find!?

Is this a term that Numerai has adopted or is it known as something else in general data science?

And in terms of what it’s actually doing, it seems to be softening up the correlations of the predictions against the features, but I don’t have a great understanding of it, can anyone offer a more thorough explenation of data neutralization please :slight_smile: ?


The idea is that you use a linear model (that is defined by the features) and you subtract that from your predictions, therefore removing that signal from your predictions or in other words: neutralizing them against the features.
On youtube there is excellent material from arbitrage and also in the forum there is an in depth post on this.

Thanks! I think I understand it in terms of the subtraction from a naive linear model :slight_smile:

In the analysis_and_tips notebook however (example-scripts/analysis_and_tips.ipynb at master · numerai/example-scripts · GitHub) the neutralization function defined there doesn’t seem to make a linear model of any sort but instead subtracts a proportion of some dot product of the pseudo-inverse of features (see code below), this is quite confusing.

def _neutralize(df, columns, by, proportion=1.0):
    scores = df[columns]
    exposures = df[by].values
    scores = scores - proportion * exposures.dot(numpy.linalg.pinv(exposures).dot(scores))
    return scores / scores.std(ddof=0)

Here is the reason that we want to feature neutralize: Feature Exposure Clipping Tool, and working code to deploy locally | Numerai FN Special Part 3 - YouTube

And the notebook discussed in that video can be found here: twitch/FE_Clipping_Script.ipynb at master · jonrtaylor/twitch · GitHub


Someone correct me if I’m wrong but I like to think of it as removing any one feature’s influence on predictions such that the resulting predictions are evenly influenced by all dependent features

1 Like

I would say that a linear model/the inverse of the matrix is “influenced” by all columns/features. The computation generally involves all columns to get the result for one column. But for the rest I mainly agree. In my understanding you take out the linear effect of the features and want to get something that depends more on the aggregate of all features.

1 Like

I was also confused by the way neutralization was done. I could see that what we are getting at is simply running a linear regression of predictions on features then residualizing that out from predictions. That was the idea in my mind, but as you mentioned, the code itself uses a pseudo-inverse and not the normal inverse of variance of predictors formula.

Well it turns out a pseudo-inverse is the solution to the least squares problem. I have a little explainer below. I assume an L-2 norm, of course if we change the norm then we change how we “neutralize”, which could be a interesting avenue:


A neutralization reaction is when an acid and a base react to form water and a salt and involves the combination of H+ ions and OH- ions to generate water. The neutralization of a strong acid and strong base has a pH equal to 7. … Table 1: The most common strong acids and bases

Custom Bottled Water

Most awesome “first post spam” I’ve yet seen…


I was reading a chapter out of Machine Learning for Finanace by Jannes Klaas, yesterday. I hadn’t realized that with raw data a single feature’s “signal” could overpower signals received by the other features causing feature bias. The example it used was fraud cases in a bank transaction database where fraud was approximately 1% of the data. If you trained your model based on the data “as is”, the model would be biased towards valid transactions. If you neutralize the features then all of the signals have an equal vote towards the outcome and your model will be able to learn from all of them. At least that is how I understand it, now.

1 Like