Summary (because long post):
TC is planned to be rolled out in 2 weeks. Although we have been given a reason why TC should be a better metric and been shown that it correlates with other metrics in the same round there is one problem. A positive score for TC in one round has practically no correlation with your TC score in the future. A positive MMC score or a positive FNC score in a round more strongly correlates with future positive TC scores. TC seems extremely noisy, if the goal of numer.ai is to improve performance on TC, it makes more sense to payout based on MMC/FNC.
Intro
Although Numer.ai has had great performances as a hedge fund over these last years, they always seem to strive to improve. We have seen this over the last years with for example Signals and with the increased dataset size + the new targets. Numer.ai now seems to have another innovation in place, True Contribution (TC). The idea behind this is to pay users more closely to what they contribute by estimating how much increasing their model to the meta model would have increased profits, taking into account how their optimizer would have had reacted to this new meta model.
The main argument in favor of this is aligning incentives between the users and the hedge fund performance, which should be fairer due to a user earning only if his result has had a positive impact on the performance and also that this should increase the performance of the hedge fund. A lot of theoretical backings have been made both explaining the math behind it and also the more high-level Alien Stock Market Intelligence medium post by Richard. From the start it has been pushed pretty heavily as make or break for Numer.ai, even when it was in a non-functioning state.
In general it feels to me like this had been decided before results were able to be gathered, and even now that they can be gathered I am a bit disappointed that the data is still missing. We have been shown some of the correlations between metrics with TC. But this is not the same as a backtest, what would have happened if you pay out on TC. We have, to my knowledge, still not received any backtest of the results of TC. In general I have the fear that the TC decision was made because it looks cool and impressive and is thus very marketable, but I feel less importance has been placed on both the effect this has on the future results of the hedge fund and the effect on the participants.
Backtest
So what is the goal of TC? The goal is probably by increasing stakes of people with high TC, to increase future performance of the hedge fund. A model that contributed positively to the hedge fundās performance is also more likely to do so in the future, right? Well, how can we test this? If a person has positive TC in a round that just ended, Numer.ai wants to give him a higher payout, with the idea that he performs better next round. But is this so?
Using Numer.aiās GraphiQL API I have downloaded all round results of round 285 up 304. So a few rounds after the last big change happened, the increased dataset size, and up to the last resolved round.
Then I attempt to figure out if users used the same model while submitting, by assuming that somebody switched models if the correlation with meta model is higher than 5% from one round to the next.
Using this data I am able to do some simple checks to see whether TC correlates with other metrics. In general for my analysis I filter everybody above a certain quantile in some metric and see how they would perform on the TC metric. For example, if a person performs well on CORR, is he likely to perform well on TC as well? You can see in the bottom left underneath that having a positive score in a round has a positive effect on the TC as well in that round. Similar effects can be seen for MMC and FNC, with FNC correlating most strongly. This all seems to be in line with the post made by MDO
Yet for all of them, the correlation is pretty weak. The top 20% of CORR only performs as top 40% of TC. If you were to compare CORR with MMC, you will get very different results. The top 20% of CORR performs top 15% of MMC (see picture below).
So this all gets me pretty scared, especially if you keep in mind that this is not the question we want to answer. We arenāt interested in the in-round correlation between the metrics. But whether positive TC in one round means better performance on TC for the next round. I have also made that analysis, and you can see the graphās about this underneath.
As we can see here the relation between TC in one round and in the next round is extremely weak. The slope does not seem to be increasing for better performance. It is basically a flat line with some small ups and downs. We can also see that positive CORR in one round does not lead to a positive CORR in the next round. Positive MMC in one round seems to work better, as MMC performance increases so does TC, albeit slightly. But the best measure for payout to optimize future TC performance seems to be FNC.
So knowing that TC in one round does not seem really predictive of TC, what does that mean? Well first of all that the payout system will not benefit the hedge fundās performance. The TC reward seems to be distributed in a way that does not improve future TC, so it is basically given away at random. If the goal is to increase the future TC performance, rewarding FNC and MMC seems a lot more like the way to go.
What do these graphs mean for the participantās? Well first of all, I am pretty disappointed to say, but it seems that your models with positive TCās can not be trusted very much to generate future positive TCās. For example I have some models with TCās of 4%, but that information does not seem to mean anything. Every other metric is more stable and would be preferable for the participants:
How is this possible?
It might seem a bit weird that a metric is not most predictive of itās future performance. Especially since the other metrics are very predictive of their own future performances. I think this is due to the huge noise generated in the optimization process combined with the great reduction of stocks that are in the end traded on. If you were to get the FnC on 50 stocks randomly picked, this metric would show huge variance. If you wanted to predict which model is most likely to score high on this metric for next weekās round, you wouldnāt pick the model performing best on this FnC_50, but you would pick the performance on the normal FnC. Although TC might directly measure what we want, the huge noise generated in this measure makes this a very undesirable measure.
Another field where you generally see this concept is in poker. I used to play this professionally and one of the most important things you have to keep in mind when analyzing your game is the difference between outcome and expected outcome. A simplified example: Imagine if you go all-in and you have 50% to win and 50% to lose. Either way the outcome will not reflect the true value of the all-in. If you were to lose the all-in the conclusion that this was a bad all-in or if you would have won and that this was a great all-in donāt seem correct so mentally rewarding the all-in based on the outcome seems not the best thing to do. What you would want to do is look at the metric of the expected outcome, which in this case would be the expected win percentage.
If you were to take this example back to Numer.ai, you might come to the conclusion that the difference between outcomes and expected outcome can also be huge. It seems that the predictive power of the models expressed in FNC (predictions are neutralized before used) and MMC do drive the true outcome, so they can be seen as the expected outcome in this case. Then you wouldnāt want to look at the outcome of the effect of using the predictions, the TC, because the variance that gets added to the true drivers is huge. It is a lot more efficient to look at the true drivers, FNC and MMC (or better metrics yet to be found).
Other TC Problems
- TC calculations seem complicated what are the chances bugs are created in this process?
- TC will reduce the ability of the participants to confirm the correctness of payouts.
- TC payouts make the generated signal dependent on the optimizer, locking in the optimizer. Likely this optimizer is not optimal, but changing the optimizer would then change the rewards of all the players and disrupt the entire tournament. This seems undesirable. Also the optimizer seems clearly not optimal. Looking at the results of a round you will generally see multiple people in the top 20 with negative TCās. See our most recent round for example. Not being able to use this signal seems to be a mistake in the optimizer to me, not in their predictions.
- Due to the complexity and unclear feedback loop and lack of real data regarding this, the expected improvement in models will be low.
Possible better idea
Change the metric to something that is more verifiable, more consistent and has a better correlation with future TC such as FNC v3
Code
Will be posted underneath this post.
Feedback
In general I have not seen a lot of discussion recently on the effects of using TC as the new MMC or possibly in the future as the main metric. I would like to hear other peopleās thoughts/perspectives/analysis on this.