Model Diagnostics: Feature Exposure

jrb · September 8, 2020, 8:33am

I tried memory profiling a quick and dirty script to neutralize the example predictions on my laptop with Python 3.7. The laptop is running OSX Catalina and has 64 GB of RAM. My intuition about memory usage seems to have been roughly correct. With numpy.linalg.pinv the memory usage peaks at 19886 MB, with scipy.linalg.pinv2 it peaks at 19882.8 MB, and with scipy.linalg.pinv, I had to manually kill the script after its RSS grew beyond 60 GiB.

Since the difference between the amount of physical memory you have and the peak observed memory usage isn’t too big, I’d recommend checking if you have swap enabled and adding a 8 GiB swap file, if you don’t have it already. This is very easy if you’re on Linux with mkswap, swapon and swapoff. IIRC, you should also be able to do this on Windows, where swap is called a page file.

voidcentury · September 10, 2020, 5:13am

I think it also makes sense to do neutralization separately for each era as the feature correlations keep changing across eras. This way there would also be no memory issues.

wigglemuse · September 10, 2020, 1:15pm

Yes, it doesn’t really make sense to me to neutralize EXCEPT on an era-by-era basis, which shouldn’t be a problem computationally.

jeremy_berros · September 17, 2020, 6:22am

Thanks @jrb @voidcentury @wigglemuse. Era by era neutralization makes sense and I confirm that memory is not an issue anymore using a function greatly inspired by analysis_and_tips.ipynb

def full_neutralization(df, feature_names, pred_name="prediction_kazutsugi"):
    df[pred_name] = df.groupby("era").apply(lambda x: normalize_and_neutralize(x, [pred_name], feature_names))
    scaled_preds = MinMaxScaler().fit_transform(df[[pred_name]])
    return scaled_preds

Now I need to spend some more time upstream on my feature engineering / selection

wigglemuse · September 20, 2020, 11:46pm

Here’s that neutralization code for R:

neutralize <- function(scores_v,exposures_m,proportion=1.0) {
  scores_v <- scores_v - (proportion * (exposures_m %*% (MASS::ginv(exposures_m) %*% scores_v)))
  return( scores_v/sd(scores_v) )
}

normalize_vector <- function(v) {
  qnorm( (rank(v)-0.5) / length(v) )
}

normalize_matrix <- function(m) {
  qnorm( (Rfast::colRanks(m)-0.5) / nrow(m) )
}

normalize_and_neutralize <- function(scores_v,exposures_m,proportion=1.0) {
  scores_v <- normalize_vector(scores_v)
  exposures_m <- normalize_matrix(exposures_m)
  return( neutralize(scores_v,exposures_m,proportion) )
}

You’ll need “Rfast” package for colRanks function (note that there are other packages with same-named function). “MASS” should be included in any standard R installation. As I discussed with @jrb, I recommend you call “normalize_and_neutralize” rather than just “neutralize” – your results will be different (unless your data is already normalized in the same way) and probably better. The function is expecting a numeric vector for scores and a matrix (not a data.frame) for the exposures. This has some slight differences from the python version given in the tips notebook – namely the ranking functions are using the “average” method instead of the “first” method for breaking ties which makes more sense to me for this application (as “first” essentially introduces randomness which might help, but might hurt – both functions have a parameter to can set to “first” if you want though). [Also, don’t have ties in your predictions.] And I don’t think the python version actually normalizes the exposures, only the scores. Which is fine if the exposures matrix is the raw data or is otherwise standardized/normalized, but sometimes I am neutralizing with respect to other types of transformations of the data and it is just safer.

mdo · September 21, 2020, 2:53am

Here is a slightly different take on feature neutralization. Instead of finding a linear model of your predictions and subtracting a proportion of it off, we could instead find a linear model that when subtracted off reduces your feature exposure below a certain target. We could set a target and define a loss function such that when minimized all exposures will be less than or equal to the minimum of current exposure and the maximum desired exposure. So if some features have an exposure of 0.05, and you set a max exposure of 0.10, the features with the exposure of 0.05 won’t necessarily decrease as they would in the current neutralization code. This allows you to keep some of the smaller exposures that might be important, while reducing your largest risks. Test it out and let me know what you think! Be warned, it’s not especially fast…

import torch
from torch.nn import Linear
from torch.nn import Sequential
from torch.functional import F

def exposures(x, y):
    x = x - x.mean(dim=0)
    x = x / x.norm(dim=0)
    y = y - y.mean(dim=0)
    y = y / y.norm(dim=0)
    return torch.matmul(x.T, y)

def reduce_exposure(prediction, features, max_exp):
    # linear model of features that will be used to partially neutralize predictions
    lin = Linear(features.shape[1],  1, bias=False)
    lin.weight.data.fill_(0.)
    model = Sequential(lin)
    optimizer = torch.optim.Adamax(model.parameters(), lr=1e-4)
    feats = torch.tensor(np.float32(features)-.5)
    pred = torch.tensor(np.float32(prediction))
    start_exp = exposures(feats, pred[:,None])
    # set target exposure for each feature to be <= current exposure
    # if current exposure is less than max_exp, or <= max_exp if  
    # current exposure is > max_exp
    targ_exp = torch.clamp(start_exp, -max_exp, max_exp)

    for i in range(100000):
        optimizer.zero_grad()
        # calculate feature exposures of current linear neutralization
        exps = exposures(feats, pred[:,None]-model(feats))
        # loss is positive when any exposures exceed their target
        loss = (F.relu(F.relu(exps)-F.relu(targ_exp)) + F.relu(F.relu(-exps)-F.relu(-targ_exp))).sum()
        print(f'       loss: {loss:0.7f}', end='\r')
        if loss < 1e-7:
            neutralizer = [p.detach().numpy() for p in model.parameters()]
            neutralized_pred = pred[:,None]-model(feats)
            break
        loss.backward()
        optimizer.step()
    return neutralized_pred, neutralizer

def reduce_all_exposures(df, column, neutralizers=[],
                                     normalize=True,
                                     gaussianize=True,
                                     era_col="era",
                                     max_exp=0.1):
    unique_eras = df[era_col].unique()
    computed = []
    for u in unique_eras:
        print(u, '\r')
        df_era = df[df[era_col] == u]
        scores = df_era[column].values
        exposure_values = df_era[neutralizers].values
        
        if normalize:
            scores2 = []
            for x in scores.T:
                x = (scipy.stats.rankdata(x, method='ordinal') - .5) / len(x)
                if gaussianize:
                    x = scipy.stats.norm.ppf(x)
                scores2.append(x)
            scores = np.array(scores2)[0]

        scores, neut = reduce_exposure(scores, exposure_values, max_exp)

        scores /= scores.std()

        computed.append(scores.detach().numpy())

    return pd.DataFrame(np.concatenate(computed), columns=column, index=df.index)


TOURNAMENT_NAME = "kazutsugi"
PREDICTION_NAME = f"prediction_{TOURNAMENT_NAME}"

## Get output of your model
# data[PREDICTION_NAME] = model.predict(data[feature_names])

# reduce feature exposure in each era to max_exp
data_rfe_10 = reduce_all_exposures(data,
                                   [PREDICTION_NAME],
                                   neutralizers=feature_names,
                                   era_col="era",
                                   max_exp=0.10)

# replace prediction with reduced feature exposure prediction and rescale to [0,1]
data[PREDICTION_NAME] = data_rfe_10[PREDICTION_NAME]
data[PREDICTION_NAME] -= data[PREDICTION_NAME].min()
data[PREDICTION_NAME] /= data[PREDICTION_NAME].max()

objectscience · September 21, 2020, 7:40am

Might not be a bad idea to have that officially replace the current code in the analysis and tips notebook.

wigglemuse · September 21, 2020, 12:19pm

Have both. The former is more general and way faster. And you can do quick total neutralization as a point of comparison.

jrb · October 10, 2020, 2:03pm

Here’s a command line feature neutralization script that I posted on RocketChat. Since it’s a standalone script, it should work regardless of whether you’re using Python to build your models or something else. It takes the tournament data and predictions files as the inputs and outputs a neutralized csv file.

Example usage:

# Fully neutralize predictions in example_predictions_target_kazutsugi.csv.xz wrt features in numerai_tournament_data.csv.xz
python neutralize.py numerai_tournament_data.csv.xz example_predictions_target_kazutsugi.csv.xz

Another example:

# Neutralize the top 10 highest exposed features by 50%
python neutralize.py -t 10 -p 0.5 numerai_tournament_data.csv.xz example_predictions_target_kazutsugi.csv.xz

oxioxi · December 13, 2020, 11:51am

Any way possible to keep my Colab from crashing while doing this?

surajp · December 15, 2020, 5:10pm

You can restart your runtime and then load saved predictions as float32. This will work under 10GB of colab memory.

wigglemuse · January 8, 2021, 3:57pm

It was pointed out to me that the R neutralization code I posted earlier in this thread (Model Diagnostics: Feature Exposure) doesn’t end up with values between [0,1]. That’s true – I left off that step at the end. So that’s normal, and you will need to do a minmax type rescaling to get the values into the proper ranger for submission. (The actual values you end up with in that range aren’t important as long as they remain in the same order.)

senadorancap · January 9, 2021, 2:42pm

Here is my minmax scaler function for those interested:

minmax <- function(x){(x-min(x))/(max(x)-min(x))}

kreator · January 22, 2021, 8:37pm

I try to fully understand the approach. I understand that we don’t want a bias term, but it is a bit unfamiliar for me to transform the data to a standard normal. Why do we need this?

With this line of code you first transform the input data (in our appliction the features) to a uniform distribution. Then you apply the standard normal quantile and get realizations from a standard normal. As you say, this is not always needed. What you want to have is just zero expectation (thus no bias), right? With the -0,5 you avoid the borders of the [0,1] interval, correct?

In normalize_vector you basically do the same.

With exposures_m %*% (MASS::ginv(exposures_m) you calculate the \beta of the linear model: scores = \beta * features.

Then finally you calculate score_{neutral} = scores - proportion * \beta* features and rescale score_{neutral}.

Is this understanding correct?

wigglemuse · January 22, 2021, 9:17pm

First of all, let me just say for the R code in particular that I was just translating the python code given by the team, so at first I was exactly replicating in R what they did in python so I could compare results of each version side-by-side to make sure I got it right. (This should have been trivial for a function of a few lines, but I don’t actually code in python so I had to do it one detail at a time. When I did it I didn’t quite understand the function myself because of mathematical deficiencies of my own – I didn’t even understand that the pseudo-inverse calculation was making an OLS model.) Anyway, I was just trying to get an exact translation at first, but then in the end I didn’t exactly replicate it as I noted – my version applies normalization on the “exposures” (features) as well as the scores (predictions) whereas theirs doesn’t, and I left out the min-max scaling at the end to get it back into the [0,1] range. (I do that part later in my own workflow.) So the reason I used qnorm and subtracted 0.5 from the ranks (to avoid 0 & 1 as you noted) is simply because that matches what they did in python and I’m not sure that is an important detail for this. (They use that same type of rank normalization in the scoring function so were probably just borrowing their own code.) If we just ranked and rescaled to [0,1] I bet results would be pretty much the same (but not identical). I probably tried that, can’t remember.

Also with the lack of the bias term – if you add one I don’t think it hurts, but results will be basically the same. (I definitely tested that.) And I don’t see why it is necessary to divide the result by the standard deviation (since that doesn’t change any rankings), but again that’s in the python version so there it is.

kreator · January 22, 2021, 9:49pm

Thank you for your transparent comments and the efforts with the code!!

senadorancap · January 24, 2021, 10:35am

That’s the matricial format of OLS algorithm without an intercept and how do i know that? Well let me say that is about the advantages of not taking a nap during the econometric classes

I think i have one for Ridge:

  ridge_neutralize <- function(scores_v,exposures_m,proportion=1.0,ridge=1.0) {
  scores_v <- scores_v - (proportion * (exposures_m %*%
                                        (MASS::ginv((t(exposures_m) %*% exposures_m) +
                                        ridge*length(scores_v)*(diag(ncol(exposures_m)))) %*%
                                        (t(exposures_m) %*% scores_v))))
  return( scores_v/sd(scores_v) )
}

lollocodes · February 8, 2021, 9:50am

guys im having issues with the ram , how can i run the code ?

ieortools · February 15, 2021, 3:31pm

@lollocodes depends on you system and language of choice. I use R and I’ve found that you need a decent memory size to perform this analysis. My machine has 16GB RAM. I would imagine this is simiar to using Python.

With using R I’ve also found using h2o.ai works rather well. h2o.ai allows to run in parallel and multiple clusters.

acassoni · March 19, 2021, 4:08pm

Hello! I am new to the competition…

Can someone explain to me the difference between applying feature neutralization to the features on the target, to get a set of features that contain as much original information as possible but decorrelate with the target VS neutralizing predictions by features?

Thanks.