Learning Two Uncorrelated Models

Numerai is really just about finding some function (a model) that maps the features to a vector of predictions.

Imagine instead Numerai posed the following problem: learn a function that maps the features to two prediction vectors such both prediction vectors have equal performance but have a correlation of zero with each other.

How would you learn two such predictions at the same time? If your two models have zero correlation with each other over the training data, how do you feel confident that they will have zero correlation out of sample?

Since by construction, your two models are uncorrelated, they are likely to have very different feature exposures eg one might heavily weight the intelligence group of features and the other might heavily weight the charisma features. Therefore the two models are likely to combine very well together. In fact, you might take both your models and sum them together and use that as your model. Does this combination of the two uncorrelated predictions outperform your best model? It might have lower mean but does it have higher Sharpe and better consistency and stationarity?

Numerai’s new Meta Model Contribution payouts launching soon opens up the value of models which aren’t simply good at mapping from the features to predictions but are also uncorrelated with the meta model. Trying to build two models that are both good but perfectly uncorrelated out of sample is a worthwhile exercise to get good at MMC. Perhaps one of the two or both of the two models you build in this way will have strong MMC.

8 Likes

This isn’t quite what you’re suggesting, but tangentially related. In my attempt at an MMC maximizing model, I have a penalty term in my loss function for correlation with example predictions. I got a reasonably low correlation (0.4243) with example predictions, which could’ve perhaps be pushed lower if it wasn’t for early stopping.

5 Likes

That’s a great idea.

How did you implement the loss function against correlation with example predictions? Would you care to share that code? If you are training to maximize MMC in sample, does it end up staying uncorrelated with example predictions out of sample when cross validated?

3 Likes

Thanks! I wasn’t trying to directly maximise MMC, only to minimize correlation with example predictions and to maximize rank correlation with the labels. My first attempt was to try to use JAX’s version of np.corrcoef, but it isn’t differentiable. As it turned out, the formula for Pearson’s correlation coefficient translates directly into simple differentiable code.

import jax.numpy as jnp

def correlation(x, y, axis=-1):
    mx = jnp.mean(x, axis=axis)
    my = jnp.mean(y, axis=axis)
    xm, ym = x - mx, y - my
    r_num = jnp.mean(xm * ym, axis=axis)
    r_den = jnp.std(xm, axis=axis) * jnp.std(ym, axis=axis)
    return r_num / (r_den + jnp.finfo(float).eps)

This can be directly dropped into the loss function as a weighted term. Maximizing rank correlation builds on this, but needs a differentiable ranking function (which is a big ball of hair). The final loss function is a weighted sum of a regression loss (MSE), ranking loss and the aforementioned penalty for correlation with example predictions.

I’ve also tried minimizing feature correlations directly with this, but it’s excruciatingly slow and it’s easier and far more efficient to enforce feature sparsity with L1 regularization.

And yes, once the loss weights are properly tuned, the decorrelation with example predictions transfers fairly well, out of sample.

7 Likes

Edit made a mistake in the code, its fixed now

Well might as well start off with the obvious solution: PCA features. Given the outputs of PCA must by definition be uncorrelated with each other, a set of linear models trained on different principal components should have low correlation to one another. Below is the sample code to test this hypothesis:


import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from scipy import stats


def get_spearman_by_era(p,target,eras):
    
    df = pd.DataFrame(columns = ('p','target','eras'))
    df.loc[:,'p'] = p
    df.loc[:,'target'] = target
    df.loc[:,'eras'] = eras
    era_names = df.eras.unique()
    output_dict = dict()
    for i in era_names:
        p0 = df.loc[df.eras == i,'p']
        target0 = df.loc[df.eras == i,'target']
        output_dict[i] = stats.spearmanr(p0,target0)
    return output_dict

def get_sharpe_ratio(spearmans):
    output = np.zeros((len(spearmans)))
    j = 0
    for i in spearmans:
        output[j] = spearmans[i][0]
        j = j + 1
    return np.mean(output)/np.std(output)


print('loading files')
numerai_training_data = pd.read_csv("Input/numerai_training_data.csv")
numerai_tournament_data = pd.read_csv("Input/numerai_tournament_data.csv")
numerai_validation_data = numerai_tournament_data.loc[numerai_tournament_data.data_type == "validation"]

eras_train = numerai_training_data.loc[:,'era']
eras_valid = numerai_validation_data.loc[:,'era']

print('transforming data')
X_train = numerai_training_data.loc[:,'feature_intelligence1':'feature_wisdom46'].values
X_valid = numerai_validation_data.loc[:,'feature_intelligence1':'feature_wisdom46'].values


Y_train = numerai_training_data.loc[:,'target_kazutsugi'].values
Y_valid = numerai_validation_data.loc[:,'target_kazutsugi'].values


pca = PCA(svd_solver='full')
pca.fit(X_train)

X_train_pca = pca.transform(X_train)
X_valid_pca = pca.transform(X_valid)


models = list()
p_train = list()
p_valid = list()

num_models = 5

for i in range(num_models):

    X = np.c_[X_train_pca[:,i],X_train_pca[:,i+num_models],X_train_pca[:,i+num_models*2],X_train_pca[:,i+num_models*3],X_train_pca[:,i+num_models*4]]
    models.append(LinearRegression().fit(X,Y_train))
    p_train.append(models[i].predict(X))
    p_valid.append(models[i].predict(np.c_[X_valid_pca[:,i],X_valid_pca[:,i+num_models],X_valid_pca[:,i+num_models*2],X_valid_pca[:,i+num_models*3],X_valid_pca[:,i+num_models*4]]))
    
    spearman_by_era = get_spearman_by_era(p_train[i],Y_train,eras_train)
    spearman_by_era_valid = get_spearman_by_era(p_valid[i],Y_valid,eras_valid.reset_index(drop =True))
    print('Spearmans')
    print(stats.spearmanr(p_train[i],Y_train))
    print(stats.spearmanr(p_valid[i],Y_valid))
    print('Sharpe Ratios')
    print(get_sharpe_ratio(spearman_by_era))
    print(get_sharpe_ratio(spearman_by_era_valid))
    print('')
    
corr_train = np.corrcoef(p_train)
corr_valid = np.corrcoef(p_valid)

print('Correlation Coeficients')
print(np.corrcoef(p_train))
print(corr_train)
print(corr_valid)


With the resulting output being:

loading files
transforming data
Spearmans
SpearmanrResult(correlation=0.01497619233887175, pvalue=2.6928745376728484e-26)
SpearmanrResult(correlation=0.013284837628533032, pvalue=1.4017169427249316e-05)
Sharpe Ratios
0.6599993421219399
0.48983802341438826

Spearmans
SpearmanrResult(correlation=0.010373654196207392, pvalue=2.0012757690323247e-13)
SpearmanrResult(correlation=0.012355041163440845, pvalue=5.355087706273797e-05)
Sharpe Ratios
0.5710633677918008
0.5557453873478579

Spearmans
SpearmanrResult(correlation=0.020850179835250258, pvalue=2.2366481230283833e-49)
SpearmanrResult(correlation=0.006082960200549018, pvalue=0.046722525083223526)
Sharpe Ratios
0.6798675967774218
0.20818493808291985

Spearmans
SpearmanrResult(correlation=0.012221942605479852, pvalue=4.795609428680984e-18)
SpearmanrResult(correlation=0.00434274497569329, pvalue=0.1556537105563265)
Sharpe Ratios
0.4294696314411901
0.2230339401171146

Spearmans
SpearmanrResult(correlation=0.01383767783987328, pvalue=1.094686955408993e-22)
SpearmanrResult(correlation=0.00010391446593068246, pvalue=0.9728976999576355)
Sharpe Ratios
0.5734665431860216
0.008617590297056693

Correlation Coeficients
[[ 1.00000000e+00 -6.57942828e-16 4.68513837e-16 -5.40879164e-16
-1.09616316e-15]
[-6.57942828e-16 1.00000000e+00 -4.43714100e-16 4.01355067e-16
9.56080427e-16]
[ 4.68513837e-16 -4.43714100e-16 1.00000000e+00 -9.55135609e-16
-4.48100911e-16]
[-5.40879164e-16 4.01355067e-16 -9.55135609e-16 1.00000000e+00
4.84548276e-16]
[-1.09616316e-15 9.56080427e-16 -4.48100911e-16 4.84548276e-16
1.00000000e+00]]
[[ 1.00000000e+00 -6.57942828e-16 4.68513837e-16 -5.40879164e-16
-1.09616316e-15]
[-6.57942828e-16 1.00000000e+00 -4.43714100e-16 4.01355067e-16
9.56080427e-16]
[ 4.68513837e-16 -4.43714100e-16 1.00000000e+00 -9.55135609e-16
-4.48100911e-16]
[-5.40879164e-16 4.01355067e-16 -9.55135609e-16 1.00000000e+00
4.84548276e-16]
[-1.09616316e-15 9.56080427e-16 -4.48100911e-16 4.84548276e-16
1.00000000e+00]]
[[ 1.00000000e+00 -3.81042255e-02 3.84405057e-03 -3.95131357e-02
-5.40280742e-03]
[-3.81042255e-02 1.00000000e+00 -9.78007427e-04 1.06135805e-02
-1.03187024e-02]
[ 3.84405057e-03 -9.78007427e-04 1.00000000e+00 4.45218610e-02
-7.32469233e-03]
[-3.95131357e-02 1.06135805e-02 4.45218610e-02 1.00000000e+00
-4.22985841e-02]
[-5.40280742e-03 -1.03187024e-02 -7.32469233e-03 -4.22985841e-02
1.00000000e+00]]

The highest correlation coefficient between any of the five models is ~0.045. The out of sample correlation is also very similar to the in sample correlation. So that answers the question of whether samples generated this way will maintain there orthogonality. None of these models are particularly ‘good’ though so I wouldn’t ever recommend staking on an individual one of them however this does demonstrate that creating multiple positive Sharpe uncorrelated predictions is pretty trivial. Despite the fact that there is probably a way to game MMC with very large numbers of these types of models, I don’t know if there is any benefit to numerai having large numbers of models like this…

2 Likes

Sorry for the silly question but what do you refer as example predictions? you mean validation data in the tournament df?

here: https://numer.ai/integration_test

the dataset zip includes training data, tournament data and a csv file of the predictions made from the example model which is also found in the zip.