2x Feature Neutralization Speed Up

by256 · November 8, 2021, 9:19am

Here’s a simple linear algebra trick to speed up feature neutralization (I got a 2x speed up on my machine but this will vary depending on your hardware).

In the feature neutralization function provided by Numerai, simply replace the line

scores -= proportion * (exposures @ (np.linalg.pinv(exposures) @ scores))

with

scores -= proportion * (exposures @ np.linalg.lstsq(exposures, scores)[0])

The two are equivalent, since finding the least squares solution to Ax = b is equivalent to taking the pseudo-inverse of A then matrix multiplying by b, which is slower and less numerically stable due to the pseudo-inverse.

This simple test in numpy shows that they’re equivalent:

import numpy as np

M, N, = 2000, 40

A = np.random.normal(size=(M, N))
b = np.random.normal(size=(M,1))

x_pinv = np.linalg.pinv(A) @ b
x_lstsq = np.linalg.lstsq(A, b)[0]

print(np.all(np.isclose(x_pinv, x_lstsq)))

This should return True, and their mean-squared error is around 1e-33.

jefferythewind · November 9, 2021, 11:35am

This is great. I was just thinking how I could speed up this neutralization code, and this looks perfect. Something I noticed in the code I pulled down from github is that they have converted the numbers to float 32 datatype for this step, probably to save time. So that actually is going to also be less accurate that what you’re suggesting. Without the conversion, the datatype should be float 64. With your version it is still running pretty quickly.

bigbertha · November 9, 2021, 8:55pm

This code generates the following warning. Not that I mind it, it is just that I do not know where to pass the rcond parameter into and I do not know whether I want the current or future behaviour …

FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  scores -= proportion * (exposures @ np.linalg.lstsq(exposures, scores)[0])

by256 · November 9, 2021, 9:16pm

@bigbertha

The rcond param is passed to np.linalg.lstsq, so the line becomes

scores -= p * (exposures @ (np.linalg.lstsq(exposures, scores, rcond=None)[0]))

The difference between the new and old behaviour is explained in the np.linalg.lstsq docs. I trust that the numpy overlords know what they’re doing, so I’m personally using the new default behaviour.

Hope this helps!

bluesapphire · December 8, 2021, 3:35pm

Thanks for catching this.

I’m an old timer - and back then, we had been forbidden to ever invert a matrix and then multiply by b. That’s numerically far less stable than a backsolve (and even within backsolve, there are several choices of factorizations, some more applicable than others). There’s too many ‘packages’ these days for anyone to care to learn all that

bluesapphire · December 9, 2021, 1:38am

You may wanna try np.linalg.solve() - this is their backsolve, with a factorization such as LU, that avoids inverting the matrix.

Topic		Replies	Views
How to Safely Perform Feature Neutralization Data Science	3	2690	October 3, 2020
What exactly is neutralization? Data Science	11	6665	December 8, 2021
Better neutralization? Data Science	6	2303	July 23, 2022
Feature neutralization workflow Data Science	6	5476	February 24, 2021
An introduction to feature neutralization / exposure Tournament	0	5697	February 15, 2022

2x Feature Neutralization Speed Up

Related topics