Era Boosted Models

I want to share a new model that we have studied internally at Numerai. We think it is a big improvement over the current example predictions model and the simple idea behind it could be helpful to many Numerai data scientists. So we want to share it.

Our current example predictions have performed well over the last few months and currently place 27th in the tournament. The example predictions are built from a simple XGBoost model but with some important tweaks such as setting colsample_bytree=0.1 to reduce overfitting to specific features.

Let’s look at the in sample performance of the example predictions as we train that model on the training data.

After 10 trees
era_scores.plot(kind="bar")
plt.show()
10 trees

As you can see even after 10 trees the model is has learned to get positive correlation in most eras. However, there are still many eras with very weak or even negative correlations. (The x-axis here is eras in order - apologies for the image.)

After 200 trees
era_scores.plot(kind="bar")
plt.show()

performance through time

After 200 trees, there are fewer negative eras and of course the mean correlation is a lot higher however performance of the model is very inconsistent even in sample. Because of the large standard deviation between eras, the Sharpe of this model is only 2.28 in sample after 100 trees. We can see the problem here is that the XGBoost model is really just trying to maximize it’s mean performance over the training data it is not also trying to minimize the standard deviation of returns across eras or produce a stationary model through time (see post on Performance Stationarity).

So can we improve the XGBoost model we use in example predictions so that it cares about improving Sharpe over time not just mean correlation? It is possible that there are fancy neural network architectures and cost functions that can do this directly in the learning. But there’s another way; borrowing ideas from boosting we can simply upweight the eras we want to improve not just the training examples. We call this Era Boosting.

The Era Boosting algorithm
Build 10 trees on all eras in the training data
Predict with your model over the training data and see which eras are in the worst half of performance vs the other eras
Then build 10 new trees but only on the worst half of eras
Predict with all your trees over the training data and see which eras are in the worst half of performance vs the other eras
Then build 10 new trees but only on the worst half of eras… repeat

Era Boosting in action
The first 10 trees are the same as example predictions - they are built with all eras
10 trees era boost

But after 200 trees where every 10 trees we told the model to only build trees for the eras where it was underperforming the results change dramatically.
200 trees

With the same number of trees as before, we now have no negative eras and the eras have consistent and similar performance with low standard deviation among era performance. The in sample Sharpe here is now 21.99.

By building trees on only the worst performing eras, we are in a sense asking the model to learn something that gives equal performance across all eras and minimizes the performance difference between a good era and a bad era. We are asking the model to learn something more stationary and consistent and so it does.

Btw the era boosted models also have lower autocorrelation and higher Smart Sharpe in sample than regular models (see again Performance Stationarity).

Michael wrote up some simple code to do era boosting which I have shared below. I think we’ll integrate the idea into example scripts soon and perhaps use the idea for a new version of example predictions.

Open questions:
I didn’t talk about out of sample performance in this post - Sharpe of 22 is absolutely an overfit so how can one use this idea without overfitting so quickly? Does it need slower learning rate?
Does era boosting really perform better than example predictions if you did cross validation by holding out groups of eras?
We equal weight all the worst performing eras but perhaps they should have weights that grow in some exponential way like AdaBoost does.
Can bagging on eras help for example choosing a random sample of 67% of eras before selecting the worst half of eras? This would let the model see a more diverse distribution of eras.
Do the era boosted models automatically feature neutralize themselves in some sense or do they also take on high feature exposures? Is their feature exposure lower than eg preds?

def ar1(x):
    return np.corrcoef(x[:-1], x[1:])[0,1]

def autocorr_penalty(x):
    n = len(x)
    p = ar1(x)
    return np.sqrt(1 + 2*np.sum([((n - i)/n)*p**i for i in range(1,n)]))

def smart_sharpe(x):
    return np.mean(x)/(np.std(x, ddof=1)*autocorr_penalty(x))

import matplotlib.pyplot as plt
def era_boost_train(X, y, era_col, proportion=0.5, trees_per_step=10, num_iters=200):
    model = GradientBoostingRegressor(max_depth=5, learning_rate=0.01, max_features="sqrt", subsample=0.5, n_estimators=trees_per_step, warm_start=(num_iters>1))
    features = X.columns
    model.fit(X, y)
    new_df = X.copy()
    new_df["target"] = y
    new_df["era"] = era_col
    for i in range(num_iters-1):
        print(f"iteration {i}")
        # score each era
        print("predicting on train")
        preds = model.predict(X)
        new_df["pred"] = preds
        era_scores = pd.Series(index=new_df["era"].unique())
        print("getting per era scores")
        for era in new_df["era"].unique():
            era_df = new_df[new_df["era"] == era]
            era_scores[era] = spearmanr(era_df["pred"], era_df["target"])[0]
        era_scores.sort_values(inplace=True)
        worst_eras = era_scores[era_scores <= era_scores.quantile(proportion)].index
        print(list(worst_eras))
        worst_df = new_df[new_df["era"].isin(worst_eras)]
        era_scores.sort_index(inplace=True)
        era_scores.plot(kind="bar")
        print("performance over time")
        plt.show()
        print("autocorrelation")
        print(ar1(era_scores))
        print("mean correlation")
        print(np.mean(era_scores))
        print("sharpe")
        print(np.mean(era_scores)/np.std(era_scores))
        print("smart sharpe")
        print(smart_sharpe(era_scores))
        model.n_estimators += trees_per_step
        print("fitting on worst eras")
        model.fit(worst_df[features], worst_df["target"])
    return model

boost_model = era_boost_train(train_features, train_targets["target_kazutsugi"], era_col=train_targets["era"], proportion=0.5, trees_per_step=10, num_iters=20)
15 Likes

That is a pretty interesting topic. Just adding my 2 cents here.
If we refer to the previous discussions about performance stationarity, this seems like the perfect solution.
I have not looked at the model myself but does it generalizes at all? What are the results on the validation period?
I assume it is not too complex to achieve this result with trees. Just fit each era independently and ensemble the trees for example should give something probably 90% of the way there I imagine ( do not quote me on this :slight_smile: ).
This brings me to the important point that I think Numerai meant by performance stationarity, and it is instead model stationarity. If the model is fitted and behaving differently across eras of the training set, I do not believe this model has much chance of generalizing well as we have a limited set of eras. This is the typical bias/variance trade off.

Era boosting with XGBoost.

from xgboost import XGBRegressor

def spearmanr(target, pred):
    return np.corrcoef(
        target,
        pred.rank(pct=True, method="first")
    )[0, 1]

def era_boost_train(X, y, era_col, proportion=0.5, trees_per_step=10, num_iters=200):
    model = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=trees_per_step, n_jobs=-1, colsample_bytree=0.1)
    features = X.columns
    model.fit(X, y)
    new_df = X.copy()
    new_df["target"] = y
    new_df["era"] = era_col
    for i in range(num_iters-1):
        print(f"iteration {i}")
        # score each era
        print("predicting on train")
        preds = model.predict(X)
        new_df["pred"] = preds
        era_scores = pd.Series(index=new_df["era"].unique())
        print("getting per era scores")
        for era in new_df["era"].unique():
            era_df = new_df[new_df["era"] == era]
            era_scores[era] = spearmanr(era_df["pred"], era_df["target"])
        era_scores.sort_values(inplace=True)
        worst_eras = era_scores[era_scores <= era_scores.quantile(proportion)].index
        print(list(worst_eras))
        worst_df = new_df[new_df["era"].isin(worst_eras)]
        era_scores.sort_index(inplace=True)
        era_scores.plot(kind="bar")
        print("performance over time")
        plt.show()
        print("autocorrelation")
        print(ar1(era_scores))
        print("mean correlation")
        print(np.mean(era_scores))
        print("sharpe")
        print(np.mean(era_scores)/np.std(era_scores))
        print("smart sharpe")
        print(smart_sharpe(era_scores))
        model.n_estimators += trees_per_step
        booster = model.get_booster()
        print("fitting on worst eras")
        model.fit(worst_df[features], worst_df["target"], xgb_model=booster)
    return model

boost_model = era_boost_train(train_features, train_targets["target_kazutsugi"], era_col=train_targets["era"], proportion=0.5, trees_per_step=10, num_iters=20)
4 Likes

Xgboost is still problematic run this way. Let’s modify @mdo 's code a bit:

def era_boost_train(X, y, era_col, proportion=0.5, trees_per_step=10, num_iters=200, one_shot=False, tree_method='gpu_hist', test_model=None, note=None):
    print(f"\n#### Era boost train with proportion {proportion:0.3f} ####\n")
    if note is not None:
        print(note)
    if one_shot:
        trees_per_step = trees_per_step * num_iters
        num_iters=1

    if test_model is None:
        print(f"Train {num_iters} iterations")
        print(f"Train {trees_per_step} rounds per iteration")
    else:
        print("Testing model performance")
    features = X.columns
    new_df = X.copy()
    new_df["target"] = y
    new_df["era"] = era_col
    for i in range(num_iters):
        print(f"\nIteration {i+1}:\n")
        if test_model is None:
            if i==0:
                model = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=trees_per_step, n_jobs=-1, colsample_bytree=0.1, tree_method=tree_method)
                model.fit(X, y)
            else:
                model.n_estimators += trees_per_step
                booster = model.get_booster()
                print("fitting on worst eras")
                model.fit(worst_df[features], worst_df["target"], xgb_model=booster)
        else:
            model = test_model
        # score each era
        print("predicting on train")
        preds = model.predict(X)
        new_df["pred"] = preds
        era_scores = pd.Series(index=new_df["era"].unique())
        print("getting per era scores")
        for era in new_df["era"].unique():
            era_df = new_df[new_df["era"] == era]
            era_scores[era] = spearmanr(era_df["pred"], era_df["target"])
        era_scores.sort_values(inplace=True)
        worst_eras = era_scores[era_scores <= era_scores.quantile(proportion)].index
        print(list(worst_eras))
        worst_df = new_df[new_df["era"].isin(worst_eras)]
        era_scores.sort_index(inplace=True)
        era_scores.plot(kind="bar")
        print("performance over time")
        plt.show()
        plt.savefig(outdir+f"fig_{i}.png")
        print("autocorrelation")
        print(ar1(era_scores))
        print("mean correlation")
        print(np.mean(era_scores))
        print("sharpe")
        print(np.mean(era_scores)/np.std(era_scores))
        print("smart sharpe")
        print(smart_sharpe(era_scores))
    return model

Now we run it setting the proportion equal to one. The idea is that we should get the same results either training incrementally or training all at once for that particular proportion. Let’s look at in sample and out of sample results.

ifile = idir+"numerai_training_data.csv"
print(f"Read Numerai training data from {ifile}...")
df = pd.read_csv(ifile)
features = [c for c in df if c.startswith("feature")]
X = df[features]

note = "First create era boosted model."
boost_model_1 = era_boost_train(X, df["target_kazutsugi"], era_col=df["era"], proportion=1.0, trees_per_step=trees_per_step, num_iters=num_iters, note=note)
note = "Now create a regular model."
boost_model_2 = era_boost_train(X, df["target_kazutsugi"], era_col=df["era"], proportion=1.0, trees_per_step=trees_per_step, num_iters=num_iters, one_shot=True, note=note)

print("\nRead Numerai Tournament data...\n")
tfile = idir+"numerai_tournament_data.csv"
tdf = pd.read_csv(tfile).set_index("id")
tdf = tdf.loc[tdf['data_type'] == 'validation',].copy()
X = tdf[features]
note = "Look at out of sample for era-boosted model:"
era_boost_train(X, tdf["target_kazutsugi"], era_col=tdf["era"], proportion=1.0, trees_per_step=trees_per_step, num_iters=num_iters, one_shot=True, test_model=boost_model_1, note=note)
note = "Look at out of sample for regular model:"
era_boost_train(X, tdf["target_kazutsugi"], era_col=tdf["era"], proportion=1.0, trees_per_step=trees_per_step, num_iters=num_iters, one_shot=True, test_model=boost_model_2, note=note)

Now the results:

Read Numerai training data from ../input/ds_0208/numerai_training_data.csv...

#### Era boost train with proportion 1.000 ####

First create era boosted model.
Train 3 iterations
Train 10 rounds per iteration

Iteration 1:

predicting on train
getting per era scores
['era68', 'era103', 'era58', 'era91', 'era104', 'era60', 'era69', 'era41', 'era9', 'era42', 'era107', 'era110', 'era106', 'era19', 'era113', 'era119', 'era101', 'era73', 'era49', 'era66', 'era27', 'era85', 'era50', 'era74', 'era67', 'era89', 'era75', 'era57', 'era116', 'era70', 'era112', 'era21', 'era26', 'era46', 'era7', 'era18', 'era33', 'era40', 'era100', 'era59', 'era82', 'era32', 'era102', 'era79', 'era65', 'era31', 'era87', 'era118', 'era17', 'era84', 'era11', 'era62', 'era97', 'era80', 'era35', 'era28', 'era2', 'era55', 'era14', 'era3', 'era15', 'era114', 'era54', 'era34', 'era117', 'era43', 'era81', 'era88', 'era56', 'era44', 'era39', 'era111', 'era72', 'era24', 'era13', 'era37', 'era99', 'era45', 'era77', 'era98', 'era93', 'era25', 'era10', 'era71', 'era29', 'era61', 'era96', 'era51', 'era53', 'era78', 'era86', 'era38', 'era20', 'era63', 'era8', 'era5', 'era6', 'era95', 'era47', 'era94', 'era23', 'era52', 'era1', 'era4', 'era64', 'era30', 'era12', 'era108', 'era120', 'era92', 'era36', 'era48', 'era115', 'era90', 'era76', 'era22', 'era109', 'era83', 'era105', 'era16']
performance over time
autocorrelation
0.09222123754537882
mean correlation
0.06679633035407485
sharpe
2.1846286077155144
smart sharpe
1.9848699585634093

Iteration 2:

fitting on worst eras
predicting on train
getting per era scores
['era68', 'era103', 'era58', 'era91', 'era104', 'era60', 'era41', 'era69', 'era107', 'era66', 'era9', 'era42', 'era19', 'era106', 'era27', 'era49', 'era110', 'era85', 'era74', 'era101', 'era73', 'era113', 'era119', 'era67', 'era89', 'era75', 'era31', 'era7', 'era70', 'era40', 'era33', 'era50', 'era18', 'era79', 'era112', 'era100', 'era26', 'era57', 'era46', 'era116', 'era21', 'era59', 'era65', 'era82', 'era84', 'era55', 'era87', 'era32', 'era80', 'era118', 'era34', 'era2', 'era62', 'era114', 'era54', 'era17', 'era3', 'era35', 'era102', 'era111', 'era117', 'era11', 'era14', 'era15', 'era71', 'era25', 'era88', 'era81', 'era28', 'era37', 'era44', 'era97', 'era56', 'era98', 'era13', 'era24', 'era72', 'era51', 'era99', 'era43', 'era10', 'era45', 'era63', 'era78', 'era77', 'era39', 'era86', 'era29', 'era47', 'era93', 'era8', 'era61', 'era52', 'era6', 'era53', 'era20', 'era1', 'era38', 'era5', 'era94', 'era12', 'era96', 'era108', 'era23', 'era64', 'era90', 'era48', 'era120', 'era30', 'era95', 'era36', 'era115', 'era76', 'era92', 'era4', 'era22', 'era109', 'era16', 'era83', 'era105']
performance over time
autocorrelation
0.10810158200051456
mean correlation
0.07308286520688552
sharpe
2.2883751249083386
smart sharpe
2.046323092280946

Iteration 3:

fitting on worst eras
predicting on train
getting per era scores
['era68', 'era103', 'era58', 'era91', 'era104', 'era60', 'era41', 'era69', 'era107', 'era66', 'era9', 'era106', 'era49', 'era27', 'era19', 'era42', 'era113', 'era85', 'era110', 'era89', 'era119', 'era101', 'era73', 'era31', 'era7', 'era67', 'era18', 'era33', 'era40', 'era100', 'era70', 'era46', 'era112', 'era26', 'era75', 'era74', 'era57', 'era84', 'era50', 'era116', 'era65', 'era21', 'era59', 'era79', 'era32', 'era80', 'era82', 'era34', 'era87', 'era55', 'era118', 'era25', 'era111', 'era62', 'era2', 'era114', 'era54', 'era35', 'era117', 'era3', 'era17', 'era15', 'era71', 'era88', 'era102', 'era81', 'era14', 'era11', 'era28', 'era97', 'era56', 'era37', 'era98', 'era44', 'era24', 'era1', 'era51', 'era13', 'era93', 'era43', 'era99', 'era29', 'era78', 'era47', 'era52', 'era72', 'era45', 'era10', 'era53', 'era39', 'era86', 'era77', 'era63', 'era8', 'era12', 'era61', 'era20', 'era94', 'era5', 'era6', 'era38', 'era90', 'era96', 'era23', 'era64', 'era48', 'era108', 'era36', 'era30', 'era115', 'era120', 'era22', 'era76', 'era92', 'era95', 'era109', 'era16', 'era4', 'era83', 'era105']
performance over time
autocorrelation
0.10116833027108213
mean correlation
0.0782261415611412
sharpe
2.3853342448740174
smart sharpe
2.147903025024261

#### Era boost train with proportion 1.000 ####

Now create a regular model.
Train 1 iterations
Train 30 rounds per iteration

Iteration 1:

predicting on train
getting per era scores
['era68', 'era103', 'era91', 'era58', 'era41', 'era69', 'era104', 'era60', 'era66', 'era107', 'era49', 'era106', 'era27', 'era19', 'era9', 'era42', 'era113', 'era85', 'era119', 'era101', 'era110', 'era31', 'era73', 'era112', 'era67', 'era7', 'era26', 'era40', 'era84', 'era89', 'era50', 'era100', 'era46', 'era116', 'era65', 'era18', 'era33', 'era75', 'era32', 'era79', 'era21', 'era74', 'era70', 'era57', 'era59', 'era34', 'era25', 'era80', 'era87', 'era82', 'era118', 'era55', 'era111', 'era35', 'era71', 'era17', 'era62', 'era114', 'era117', 'era3', 'era14', 'era54', 'era2', 'era88', 'era15', 'era1', 'era37', 'era102', 'era28', 'era98', 'era97', 'era51', 'era24', 'era93', 'era81', 'era56', 'era47', 'era29', 'era78', 'era10', 'era44', 'era99', 'era11', 'era52', 'era12', 'era43', 'era20', 'era13', 'era77', 'era39', 'era45', 'era53', 'era94', 'era86', 'era72', 'era6', 'era61', 'era38', 'era63', 'era8', 'era96', 'era90', 'era5', 'era36', 'era115', 'era64', 'era23', 'era48', 'era22', 'era108', 'era30', 'era76', 'era120', 'era92', 'era95', 'era16', 'era109', 'era83', 'era4', 'era105']
performance over time
autocorrelation
0.08129248445845112
mean correlation
0.07639378884006609
sharpe
2.2663449766142603
smart sharpe
2.0817198544863738

Read Numerai Tournament data...


#### Era boost train with proportion 1.000 ####

Look at out of sample for era-boosted model:
Testing model performance

Iteration 1:

predicting on train
getting per era scores
['era127', 'era131', 'era121', 'era125', 'era126', 'era129', 'era130', 'era122', 'era124', 'era123', 'era128', 'era132']
performance over time
autocorrelation
-0.1550246653070757
mean correlation
0.03738069280208456
sharpe
1.4334015707235843
smart sharpe
1.583701400030438

#### Era boost train with proportion 1.000 ####

Look at out of sample for regular model:
Testing model performance

Iteration 1:

predicting on train
getting per era scores
['era127', 'era121', 'era126', 'era131', 'era125', 'era129', 'era122', 'era130', 'era123', 'era124', 'era128', 'era132']
performance over time
autocorrelation
-0.19234565338406373
mean correlation
0.03568585451261579
sharpe
1.4905205172696703
smart sharpe
1.7057722745747308

So you see: In sample, trained incrementally, with exactly the same data, xgboost generates extra sharpe (no cheese). Out of sample the regularly trained xgboost performs better. One should be really careful using xgboost this way.

1 Like

Hmmm, interesting, but those differences aren’t big enough to convince me that this isn’t just differences in how random seeds are used to select variables for the trees when training incrementally vs all in one go. When restarting training 3 times vs only once I could see how things could easily come out a bit different depending on exactly how things are implemented. I actually find your test a bit reassuring that things aren’t completely broken when training like this with xgboost.

I am able to achieve similar results using neural networks with a custom loss function which adds to the mean squared error the squared standard deviation of absolute feature correlation coefficient with predictions

y_pred=[…]
y_true=[…]
corr_coefs=[…] #310 correlation coefficients with y_pred

mse= mean(square(y_pred - y_true)
correlation_penalty= square(std(abs(corr_coefs)))
loss= mse + correlation_penalty

in sample results

numerai_score_mean_across_eras~0.397
sharpe~19.221

3 Likes

I made few tests with catboost but it seems after sufficient amount of trees it will end up with high sharpe ratio on training data anyway, even without messing with train data in between the training process.

1 Like

After playing a bit with validation 2 dataset and realizing how bad are my models (it’s already shown by the live data but I was refusing to accept the reality), I’ve been trying different approaches, one of them focused on decreasing meta-model correlation and another one trying to improve correlation across eras while reducing feature exposure.

Base model

Train avg 0.09118565453972223, sharpe 2.61238059500593
Val1 avg 0.08252843111050458, sharpe 3.173499533551588
Val2 avg 0.008812162895404011, sharpe 0.28613703839300464

Feature exposure 0.08264085332303439

While validation 2 average correlation across eras was still positive, the sharpe ratio was a disaster, that’s why I’ve been trying to improve it using a different approach to the shown here but with quite similar results. Also, there is a group of features highly correlated with the predictions!

Improved model

After trying to reduce the correlation between features and predictions I ended with a model that looks better.

Train avg 0.17071569939420486, sharpe 7.353940587005201
Val1 avg 0.16559936247951454, sharpe 8.009380660205698
Val2 avg 0.011188128720098198, sharpe 0.6254681270125086

Feature exposure 0.060100944750934554

The next step will be trying to include neutralization but I’m doubting between prediction neutralization and label neutralization, which do you think would be a better approach?

3 Likes