Here’s that neutralization code for R:
neutralize <- function(scores_v,exposures_m,proportion=1.0) { scores_v <- scores_v - (proportion * (exposures_m %*% (MASS::ginv(exposures_m) %*% scores_v))) return( scores_v/sd(scores_v) ) } normalize_vector <- function(v) { qnorm( (rank(v)-0.5) / length(v) ) } normalize_matrix <- function(m) { qnorm( (Rfast::colRanks(m)-0.5) / nrow(m) ) } normalize_and_neutralize <- function(scores_v,exposures_m,proportion=1.0) { scores_v <- normalize_vector(scores_v) exposures_m <- normalize_matrix(exposures_m) return( neutralize(scores_v,exposures_m,proportion) ) }
You’ll need “Rfast” package for colRanks function (note that there are other packages with same-named function). “MASS” should be included in any standard R installation. As I discussed with @jrb, I recommend you call “normalize_and_neutralize” rather than just “neutralize” – your results will be different (unless your data is already normalized in the same way) and probably better. The function is expecting a numeric vector for scores and a matrix (not a data.frame) for the exposures. This has some slight differences from the python version given in the tips notebook – namely the ranking functions are using the “average” method instead of the “first” method for breaking ties which makes more sense to me for this application (as “first” essentially introduces randomness which might help, but might hurt – both functions have a parameter to can set to “first” if you want though). [Also, don’t have ties in your predictions.] And I don’t think the python version actually normalizes the exposures, only the scores. Which is fine if the exposures matrix is the raw data or is otherwise standardized/normalized, but sometimes I am neutralizing with respect to other types of transformations of the data and it is just safer.