I want to share a new model that we have studied internally at Numerai. We think it is a big improvement over the current example predictions model and the simple idea behind it could be helpful to many Numerai data scientists. So we want to share it.
Our current example predictions have performed well over the last few months and currently place 27th in the tournament. The example predictions are built from a simple XGBoost model but with some important tweaks such as setting colsample_bytree=0.1 to reduce overfitting to specific features.
Letās look at the in sample performance of the example predictions as we train that model on the training data.
After 10 trees
era_scores.plot(kind="bar")
plt.show()
As you can see even after 10 trees the model is has learned to get positive correlation in most eras. However, there are still many eras with very weak or even negative correlations. (The x-axis here is eras in order - apologies for the image.)
After 200 trees
era_scores.plot(kind="bar")
plt.show()
After 200 trees, there are fewer negative eras and of course the mean correlation is a lot higher however performance of the model is very inconsistent even in sample. Because of the large standard deviation between eras, the Sharpe of this model is only 2.28 in sample after 100 trees. We can see the problem here is that the XGBoost model is really just trying to maximize itās mean performance over the training data it is not also trying to minimize the standard deviation of returns across eras or produce a stationary model through time (see post on Performance Stationarity).
So can we improve the XGBoost model we use in example predictions so that it cares about improving Sharpe over time not just mean correlation? It is possible that there are fancy neural network architectures and cost functions that can do this directly in the learning. But thereās another way; borrowing ideas from boosting we can simply upweight the eras we want to improve not just the training examples. We call this Era Boosting.
The Era Boosting algorithm
Build 10 trees on all eras in the training data
Predict with your model over the training data and see which eras are in the worst half of performance vs the other eras
Then build 10 new trees but only on the worst half of eras
Predict with all your trees over the training data and see which eras are in the worst half of performance vs the other eras
Then build 10 new trees but only on the worst half of erasā¦ repeat
Era Boosting in action
The first 10 trees are the same as example predictions - they are built with all eras
But after 200 trees where every 10 trees we told the model to only build trees for the eras where it was underperforming the results change dramatically.
With the same number of trees as before, we now have no negative eras and the eras have consistent and similar performance with low standard deviation among era performance. The in sample Sharpe here is now 21.99.
By building trees on only the worst performing eras, we are in a sense asking the model to learn something that gives equal performance across all eras and minimizes the performance difference between a good era and a bad era. We are asking the model to learn something more stationary and consistent and so it does.
Btw the era boosted models also have lower autocorrelation and higher Smart Sharpe in sample than regular models (see again Performance Stationarity).
Michael wrote up some simple code to do era boosting which I have shared below. I think weāll integrate the idea into example scripts soon and perhaps use the idea for a new version of example predictions.
Open questions:
I didnāt talk about out of sample performance in this post - Sharpe of 22 is absolutely an overfit so how can one use this idea without overfitting so quickly? Does it need slower learning rate?
Does era boosting really perform better than example predictions if you did cross validation by holding out groups of eras?
We equal weight all the worst performing eras but perhaps they should have weights that grow in some exponential way like AdaBoost does.
Can bagging on eras help for example choosing a random sample of 67% of eras before selecting the worst half of eras? This would let the model see a more diverse distribution of eras.
Do the era boosted models automatically feature neutralize themselves in some sense or do they also take on high feature exposures? Is their feature exposure lower than eg preds?
def ar1(x):
return np.corrcoef(x[:-1], x[1:])[0,1]
def autocorr_penalty(x):
n = len(x)
p = ar1(x)
return np.sqrt(1 + 2*np.sum([((n - i)/n)*p**i for i in range(1,n)]))
def smart_sharpe(x):
return np.mean(x)/(np.std(x, ddof=1)*autocorr_penalty(x))
import matplotlib.pyplot as plt
def era_boost_train(X, y, era_col, proportion=0.5, trees_per_step=10, num_iters=200):
model = GradientBoostingRegressor(max_depth=5, learning_rate=0.01, max_features="sqrt", subsample=0.5, n_estimators=trees_per_step, warm_start=(num_iters>1))
features = X.columns
model.fit(X, y)
new_df = X.copy()
new_df["target"] = y
new_df["era"] = era_col
for i in range(num_iters-1):
print(f"iteration {i}")
# score each era
print("predicting on train")
preds = model.predict(X)
new_df["pred"] = preds
era_scores = pd.Series(index=new_df["era"].unique())
print("getting per era scores")
for era in new_df["era"].unique():
era_df = new_df[new_df["era"] == era]
era_scores[era] = spearmanr(era_df["pred"], era_df["target"])[0]
era_scores.sort_values(inplace=True)
worst_eras = era_scores[era_scores <= era_scores.quantile(proportion)].index
print(list(worst_eras))
worst_df = new_df[new_df["era"].isin(worst_eras)]
era_scores.sort_index(inplace=True)
era_scores.plot(kind="bar")
print("performance over time")
plt.show()
print("autocorrelation")
print(ar1(era_scores))
print("mean correlation")
print(np.mean(era_scores))
print("sharpe")
print(np.mean(era_scores)/np.std(era_scores))
print("smart sharpe")
print(smart_sharpe(era_scores))
model.n_estimators += trees_per_step
print("fitting on worst eras")
model.fit(worst_df[features], worst_df["target"])
return model
boost_model = era_boost_train(train_features, train_targets["target_kazutsugi"], era_col=train_targets["era"], proportion=0.5, trees_per_step=10, num_iters=20)