Edit made a mistake in the code, its fixed now
Well might as well start off with the obvious solution: PCA features. Given the outputs of PCA must by definition be uncorrelated with each other, a set of linear models trained on different principal components should have low correlation to one another. Below is the sample code to test this hypothesis:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from scipy import stats
def get_spearman_by_era(p,target,eras):
df = pd.DataFrame(columns = ('p','target','eras'))
df.loc[:,'p'] = p
df.loc[:,'target'] = target
df.loc[:,'eras'] = eras
era_names = df.eras.unique()
output_dict = dict()
for i in era_names:
p0 = df.loc[df.eras == i,'p']
target0 = df.loc[df.eras == i,'target']
output_dict[i] = stats.spearmanr(p0,target0)
return output_dict
def get_sharpe_ratio(spearmans):
output = np.zeros((len(spearmans)))
j = 0
for i in spearmans:
output[j] = spearmans[i][0]
j = j + 1
return np.mean(output)/np.std(output)
print('loading files')
numerai_training_data = pd.read_csv("Input/numerai_training_data.csv")
numerai_tournament_data = pd.read_csv("Input/numerai_tournament_data.csv")
numerai_validation_data = numerai_tournament_data.loc[numerai_tournament_data.data_type == "validation"]
eras_train = numerai_training_data.loc[:,'era']
eras_valid = numerai_validation_data.loc[:,'era']
print('transforming data')
X_train = numerai_training_data.loc[:,'feature_intelligence1':'feature_wisdom46'].values
X_valid = numerai_validation_data.loc[:,'feature_intelligence1':'feature_wisdom46'].values
Y_train = numerai_training_data.loc[:,'target_kazutsugi'].values
Y_valid = numerai_validation_data.loc[:,'target_kazutsugi'].values
pca = PCA(svd_solver='full')
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_valid_pca = pca.transform(X_valid)
models = list()
p_train = list()
p_valid = list()
num_models = 5
for i in range(num_models):
X = np.c_[X_train_pca[:,i],X_train_pca[:,i+num_models],X_train_pca[:,i+num_models*2],X_train_pca[:,i+num_models*3],X_train_pca[:,i+num_models*4]]
models.append(LinearRegression().fit(X,Y_train))
p_train.append(models[i].predict(X))
p_valid.append(models[i].predict(np.c_[X_valid_pca[:,i],X_valid_pca[:,i+num_models],X_valid_pca[:,i+num_models*2],X_valid_pca[:,i+num_models*3],X_valid_pca[:,i+num_models*4]]))
spearman_by_era = get_spearman_by_era(p_train[i],Y_train,eras_train)
spearman_by_era_valid = get_spearman_by_era(p_valid[i],Y_valid,eras_valid.reset_index(drop =True))
print('Spearmans')
print(stats.spearmanr(p_train[i],Y_train))
print(stats.spearmanr(p_valid[i],Y_valid))
print('Sharpe Ratios')
print(get_sharpe_ratio(spearman_by_era))
print(get_sharpe_ratio(spearman_by_era_valid))
print('')
corr_train = np.corrcoef(p_train)
corr_valid = np.corrcoef(p_valid)
print('Correlation Coeficients')
print(np.corrcoef(p_train))
print(corr_train)
print(corr_valid)
With the resulting output being:
loading files
transforming data
Spearmans
SpearmanrResult(correlation=0.01497619233887175, pvalue=2.6928745376728484e-26)
SpearmanrResult(correlation=0.013284837628533032, pvalue=1.4017169427249316e-05)
Sharpe Ratios
0.6599993421219399
0.48983802341438826
Spearmans
SpearmanrResult(correlation=0.010373654196207392, pvalue=2.0012757690323247e-13)
SpearmanrResult(correlation=0.012355041163440845, pvalue=5.355087706273797e-05)
Sharpe Ratios
0.5710633677918008
0.5557453873478579
Spearmans
SpearmanrResult(correlation=0.020850179835250258, pvalue=2.2366481230283833e-49)
SpearmanrResult(correlation=0.006082960200549018, pvalue=0.046722525083223526)
Sharpe Ratios
0.6798675967774218
0.20818493808291985
Spearmans
SpearmanrResult(correlation=0.012221942605479852, pvalue=4.795609428680984e-18)
SpearmanrResult(correlation=0.00434274497569329, pvalue=0.1556537105563265)
Sharpe Ratios
0.4294696314411901
0.2230339401171146
Spearmans
SpearmanrResult(correlation=0.01383767783987328, pvalue=1.094686955408993e-22)
SpearmanrResult(correlation=0.00010391446593068246, pvalue=0.9728976999576355)
Sharpe Ratios
0.5734665431860216
0.008617590297056693
Correlation Coeficients
[[ 1.00000000e+00 -6.57942828e-16 4.68513837e-16 -5.40879164e-16
-1.09616316e-15]
[-6.57942828e-16 1.00000000e+00 -4.43714100e-16 4.01355067e-16
9.56080427e-16]
[ 4.68513837e-16 -4.43714100e-16 1.00000000e+00 -9.55135609e-16
-4.48100911e-16]
[-5.40879164e-16 4.01355067e-16 -9.55135609e-16 1.00000000e+00
4.84548276e-16]
[-1.09616316e-15 9.56080427e-16 -4.48100911e-16 4.84548276e-16
1.00000000e+00]]
[[ 1.00000000e+00 -6.57942828e-16 4.68513837e-16 -5.40879164e-16
-1.09616316e-15]
[-6.57942828e-16 1.00000000e+00 -4.43714100e-16 4.01355067e-16
9.56080427e-16]
[ 4.68513837e-16 -4.43714100e-16 1.00000000e+00 -9.55135609e-16
-4.48100911e-16]
[-5.40879164e-16 4.01355067e-16 -9.55135609e-16 1.00000000e+00
4.84548276e-16]
[-1.09616316e-15 9.56080427e-16 -4.48100911e-16 4.84548276e-16
1.00000000e+00]]
[[ 1.00000000e+00 -3.81042255e-02 3.84405057e-03 -3.95131357e-02
-5.40280742e-03]
[-3.81042255e-02 1.00000000e+00 -9.78007427e-04 1.06135805e-02
-1.03187024e-02]
[ 3.84405057e-03 -9.78007427e-04 1.00000000e+00 4.45218610e-02
-7.32469233e-03]
[-3.95131357e-02 1.06135805e-02 4.45218610e-02 1.00000000e+00
-4.22985841e-02]
[-5.40280742e-03 -1.03187024e-02 -7.32469233e-03 -4.22985841e-02
1.00000000e+00]]
The highest correlation coefficient between any of the five models is ~0.045. The out of sample correlation is also very similar to the in sample correlation. So that answers the question of whether samples generated this way will maintain there orthogonality. None of these models are particularly ‘good’ though so I wouldn’t ever recommend staking on an individual one of them however this does demonstrate that creating multiple positive Sharpe uncorrelated predictions is pretty trivial. Despite the fact that there is probably a way to game MMC with very large numbers of these types of models, I don’t know if there is any benefit to numerai having large numbers of models like this…