Alignment between the performance of tournament participants and hedge fund profitability is a key element in the construction of Numerai. If a model is ranked at the top of the Numerai leaderboard it should be because it is helping to improve the profitability of the hedge fund the most. Currently, users are evaluated only at the signal level: how well their signal correlates with the target (CORR) and their contribution to the Meta Model signal (MMC). However, Numerai’s portfolio is created by running our custom optimizer on the Meta Model signal. The optimizer enforces constraints and penalties on the portfolio that affect which aspects of the Meta Model signal are reflected in the final portfolio. This can create divergence between what appears to be a good model at the signal level and a model that is truly helping the fund create better portfolios.
For example, the optimizer penalizes feature exposure and thus large feature exposures in the Meta Model signal will not be reflected in the final portfolio. A user with a high feature exposure model may get great correlation with the target (for a while), but their signal will have limited influence on the portfolio since the feature exposure of the portfolio is constrained. Such a user could earn large payouts without ever contributing much information to the portfolio. This is obviously undesirable.
To better align our evaluations of users and the hedge fund performance we are introducing a new metric we call “True Contribution”. The goal of this metric is to estimate how much a user’s signal improves or detracts from the returns of Numerai’s portfolio. By using this metric for payouts, user incentives and hedge fund performance are in perfect alignment. With True Contribution as the payout metric, a user’s stake would increase if their model increased portfolio returns and decrease (burn) if the model reduced returns.
In our first first pass creating True Contribution we calculated the stake weighted Meta Model by leaving each user out in turn used the production optimizer to generate the corresponding portfolios, calculated the returns and then compare to the full stake weighted Meta Model in order to calculate “True Contribution”. There are a few problems with this formulation:
- A user’s contribution is then heavily dependent upon their stake and identical signals with the same stake get different scores
- Because users with 0 stake would always have 0 contribution there is no way to calculate the metric for unstaked users
- Users with small stakes would always have ~0 contribution
- Because the production optimizer starts from our current portfolio and enforces turnover constraints, the TC scores are heavily dependent on our past portfolios which users have no knowledge of or control over
Our latest version of TC fixes all these issues while retaining the realism of portfolio construction and returns. To do this, first we realized that the leave-one-user-out method is really just approximating a gradient calculation. What we really want is a quantification of how changing a user’s stake changes the portfolio returns, which is the gradient of portfolio returns with respect to users’ stakes. A true gradient calculation would also have the nice properties that 1) it can be computed for all users simultaneously from a single portfolio optimization rather than computing a separate optimization for each user held out and 2) it will assign the same values to identical signals with different stakes 3) it will assign proper values to 0 stakes. This first property is important for our AWS bills while the second and third properties are important for fairness in the tournament.
But performing a true gradient calculation would require taking a derivative through our portfolio optimizer, which is impossible, right? Actually, no! This seemingly magical feat can be accomplished quite simply using cvxpylayers. This remarkable package based on this award winning 2019 research paper by Agrawal et al. allows you to include a cvxpy defined convex optimization as a layer in a PyTorch model. Below is our fully differentiable PyTorch module for calculating a portfolio from user predictions and stakes using a simple Linear layer and our cvxpy based optimizer.
class SWMModel(nn.Module):
# Simple end-to-end portfolio model
def __init__(self, num_stakes, context, optimizer):
super().__init__()
self.optimizer = optimizer
self.context = context
# set initial portfolio to 0
self.context.current_portfolio[:] = 0
# stake weighted Meta Model as a Linear layer
self.lin1 = nn.Linear(num_stakes, 1, bias=False)
def forward(self, user_predictions):
# calculate stake weighted Meta Model signal
x1 = self.lin1(user_predictions)
xin = cp.Parameter(x1.shape)
# get cvxpy problem from optimizer
self.context.alpha_scores = xin
self.optimizer._build_optimization_routine(self.context.current_portfolio, self.context, True)
problem = self.optimizer._optimization_routine
assert problem.is_dpp()
# insert cvxpy problem into a CvxpyLayer
cvxpylayer = CvxpyLayer(problem, parameters=[xin], variables=problem.variables())
# solve the problem using output of swmm as input to cvxpylayer
solution = cvxpylayer(x1, solver_args={"max_iters": 1500})
out = solution[0] - solution[1]
return out, x1
We can use this module to calculate portfolio returns and the gradient of the portfolio returns with respect to stakes as follows:
swmm = SWMModel(len(stakes), context=context, optimizer=n1_optimizer)
# set weights of linear layer to be user stakes
swmm.lin1.weight.data=stakes.T
swmm.zero_grad()
# get optimized portfolio and swmm signal
swmm_port, swmm_signal = swmm(user_preds)
# calculate portfolio returns and then stake gradient wrt returns
portfolio_returns = swmm_port.T @ stock_returns
# calculate gradient
portfolio_returns.backward()
# extract gradients from Linear stake weighting layer
stake_grads = swmm.lin1.weight.grad.numpy().copy()
To regularize this gradient, reduce the effect of stake size, and reduce dependencies between user predictions we can perform dropout on the user stakes (i.e. randomly zero-out 50% of the stakes) before calculating the stake weighted Meta Model and calculating the gradients. To calculate our final TC estimate we perform 100 rounds of dropout and then average the gradients across the 100 rounds:
for i in range(100):
print(f'bag {i}', end='\r')
# set stakes with dropout
swmm.lin1.weight.data=F.dropout(stakes.T, .5)
swmm.zero_grad()
# get optimized portfolio and unoptimized signal
swmm_port, swmm_signal = swmm(user_preds)
# calculate portfolio returns and then stake gradient wrt returns
portfolio_returns = swmm_port.T @ stock_returns
portfolio_returns.backward()
stake_grads.append(swmm.lin1.weight.grad.numpy().copy())
This process gives very stable estimates that are 99.5% correlated across repeated trials with different dropout masks. The regularization also doesn’t produce results that are vastly different from the unregularized gradient, they are in fact about 90% correlated. While perhaps not absolutely necessary, we feel this regularization helps with the fairness and robustness of the metric, especially given that in reality models are dropping in and out of Numerai’s Meta Model all the time.
Taking a proper gradient solves the first three problems with our initial formulation. To address the fourth problem of making True Contribution independent of our current portfolio holdings, we can create a modified version of our optimizer where we remove the turnover constraint and allow the optimizer a full trading budget to find the optimal portfolio given the Meta Model signal. This generates a hypothetical but realistic portfolio which satisfies all the constraints of the optimizer. While this modified optimizer won’t produce the real portfolio we actually trade, the portfolio it does produce is a realistic reflection of how the Meta Model signal interacts with the portfolio optimizer and its various constraints and penalties.
Hopefully you find this formulation of TC as compelling as we do. In any case you are probably wondering what existing metrics best correspond to TC. To get a better sense of the relationship we can fit a model to predict TC scores from other metrics. A good choice for building flexible and interpretable models is the Explainable Boosting Machine (EBM). The EBM fits a generalized additive model (GAM) with 2-way interactions. The EBM is tree based like standard Gradient Boosting Machines (e.g. XGBoost, LightGBM) but is restricted to fit only GAMs. In the GAM formulation each variable (and interaction) gets its own learned function and these are all additively combined. To interpret the model you can compare importance scores and visualize the learned functions for each variable. A good proxy metric for TC would have both a high importance score and a monotonic relationship to TC. For this analysis I fit a model predicting TC from various metrics for rounds 272-300. Obviously this can only show us what TC has historically been related to and is no guarantee of what can happen in the future as user change their models. But caveats aside, let’s see what we find:
We see that far and away the best proxy is FNCv3, that is a prediction’s correlation with the target after prediction has been neutralized to the 420 features in the “medium” feature set (it will be formally announced later this week!). This measures how much alpha your signal has that isn’t linearly explained by the features. FNCv3 also shows a nice monotonic relationship to TC. (The bit of jaggedness in the functions is just overfitting and can be removed by tuning the EBM hyperparameters. The general trend is pretty obvious.)
The next best proxy is the interaction between FNCv3 and “Exposure Dissimilarity”. The “Exposure Dissimilarity” is a simple metric to compare a model’s pattern of feature exposure to the example predictions. The basic idea is that a signal containing information not already in the example predictions is likely to have a very different pattern of feature exposures. To calculate Exposure Dissimilarity:
- Calculate the correlation of a user’s prediction and the example prediction with each of the features to form two vectors U and E.
- Take the dot product of U and E divided by the dot product of E with E. This measures how similar the pattern of exposures are and is normalized to be 1 if U is identical to E.
- Subtract from 1 to form a dissimilarity metric where 0 means the same exposure pattern as example predictions, positive values indicate differing patterns of exposure and negative values indicate similar patterns but even higher exposures. Note that models with 0 feature exposure will have a dissimilarity value of 1.
Exposure Dissimilarity: 1 - U•E/E•E
By itself, Exposure Dissimilarity doesn’t explain TC, but the combination with FNCv3 in a multiplicative interaction is the next best proxy for TC. (This interaction was included explicitly because in preliminary analysis the EBM kept finding what looked like strong multiplicative interaction between these variables.) This interaction term also makes intuitive sense: TC rewards signals that are both unique and that contain feature independent alpha. This interaction term also bears a strong monotonic relationship to TC.
The next most important metric is the venerable MMC, which also shows a strong monotonic relationship to TC.
This is followed by the correlation of the top/bottom 200 elements of the feature neutralized prediction with the target, i.e. FNCv3 TB 200. This metric also shows a strong monotonic relationship to TC that is in addition to the FNC relationship. Indeed, if this metric had no additional useful information the function would not appear fairly cleanly monotonic, as we will see with CORR. This shows that good performance in the tails is also important for explaining TC.
The next most important metric is Maximum Exposure. While this metric doesn’t strongly influence TC, as you can see by the comparably small dynamic range of the function on the Y-axis, the interesting thing in this plot is that TC seems most associated with small, but nonzero maximum feature exposures. The optimal range for max feature exposures seems to be in [0.05, 0.30].
The final metric we will discuss is CORR. As you can see from the plot below the relationship between CORR and TC has small dynamic range and is notably non-monotonic. I want to emphasize that if it was only CORR in the EBM’s input, we would see an apparent monotonic relationship to TC. On average higher CORR is associated with higher TC, but when the other metrics are included they more cleanly explain TC and leave CORR with little additional variance to account for.
As you can see from the above, TC seems to capture the properties we have long recommended for user models to possess: predictive power that isn’t too dependent on single features, predictive power in the tails, uniqueness. To help everyone out, I made a follow up post demonstrating methods for directly optimizing metrics like FNC and TB200. Judging by the models doing the best at TC, some of you have been listening closely and have figured a lot of things out already
To maximize backward compatibility while maximizing the impact of TC, starting April 9th users will be able to stake on (0x or 1x CORR) + (0x or 1x or 2x TC). Staking on MMC will be automatically discontinued on that date. So if you are currently staking on 1x CORR and 2x MMC, your stake will be 1x CORR only starting April 9th unless you also elect to stake on 1x TC or 2x TC. Numerai will not automatically convert any MMC stakes to TC stakes. TC staking will start as opt-in only. There will be no changes to the payout factor for the time being.