A True Contribution backtest

Thanks @gammarat. GAMMARAT36 seems to be doing very nicely for TC so well done. I don’t think this analysis will change Numerai’s mind, they’ve gone with it and I agree with you, we’ll have to work with it. I have a couple of decent models with reasonable TC and I think they’ll cut it. Maybe. As for improving TC for those models, honest answer is I’m not confident as I don’t fully understand TC yet and I’m trying to avoid a petulant “can’t be arsed with it now” mindset as it may be beyond me and I’m midway through exam season. I’m a wannabe social/political scientist not a data scientist so I’ll leave it for now and come back to it. I’m disappointed TC has dropped so quickly though and I’m really disappointed at having to retire BillyWhizz and BearsBritches so soon. Though they’ve served me well they just won’t cut it under the new regime.

NB: My first full year is coming up soon… and I’ve made a point of architecting my models differently too.

2 Likes

You can always bet on CORR only as a transition. I think Numerai will be willing to change their mind if it obviously isn’t working, but they’ll have to see it not working first.

2 Likes

Yeah, I can but the returns will be poor enough to make doing it nigh on pointless. Very poor if the last 6 rounds are anything to go by. I have three models with decent CORR & TC scores I could use but as yet haven’t yet got the understanding to improve them though I have some ideas but not the time to put them to the test. Time will tell with TC. For sure, Numerai will do whatever is appropriate for them, I just would’ve liked a longer transition period.

1 Like

:laughing: I’m a retired applied math guy (12+ years now) so I’ve got lots of time to play…But right now my general architecture is based on a genetic algorithm determining parameters for a Gaussian Mixture model approach for individual models (models 1-25) and ganged models (26-50). It’s been fascinating to watch populations of models rise, fall, and then disappear over thousands of generations looking for a handful that might be robust…And I have yet to incorporate feature neutralization and the like. So I’ll be at this for awhile. Impoverished, but entertained :+1:

4 Likes

A post worthy of a retroactive bounty from the Council of Elders if I ever saw one! Excellent work.

In your analysis where you looked at round to round correlation of these metrics did you exclude overlapping rounds? It might be worth dropping 3/4th of the rounds to make it non-overlapping.

Have you tried extending the periods in your analysis? For example, instead of looking a someone’s TC in round k and comparing it to their TC in round k+1, does it work to instead take someone’s TC Reputation (20 round average of TC) to predict their TC Reputation for the next 20 rounds? This might smooth the results (more hands of poker reveals the better players). Is TC Reputation still much less predictive of FNC Reputation? I’m guess FNC Reputation still might win for the reasons you point out.

We ran some simulations where we simulated user payouts assuming everyone was staking on CORR + 1x TC, and backtests on this evolving meta model do make more return than paying on CORR alone. These simulations of course don’t show the whole picture because they can’t simulate the changes users would have made to their models or stakes under this TC feedback.

5 Likes

I feel like this defiantly fits in with this under discussion [Proposal] Bounty for high quality data science posts - #5 by aventurine

5 Likes

Thank you for the nice words Richard. I have been excluding overlapping rounds indeed, so I would be comparing for example (if they were finished) round 304 with round 309.

Looking both forwards 20 rounds and backwards 20 rounds would get me into the problem of wanting to use data with no big change in there. I suspect for example that the historical performance of models mattered less when the 10x data got into play. I think especially what is missing is at looking more rounds afterwards, since the payment on one round cares not just about one round after but all rounds after. I will make an extension to this when I have time, hopefully tonight.

Ah good to hear about your backtests. My CORR vs next TC graph also seems to indicate that paying for CORR to optimize future TC (which should equate to higher returns) does not seem to be the best. Maybe comparing one payout (CORR) versus 2 payouts (CORR + 1x TC) is not the most clean way to compare results. Maybe it makes more sense to compare CORR + 1x TC with CORR + 1x MMC?

1 Like

@bvmcheckking Thanks for the analysis! We did indeed compare backtest simulations using lots of permutations of payout systems including both CORR + 1x MMC and CORR + 2x MMC. The best ones were either TC alone or in addition to CORR.

There are a few reasons we would rather not use FNC is the payout metric. We don’t want to completely disincientize any feature exposure. Some amount of feature exposure and/or feature timing models can be beneficial and thus TC can incentivize a much wider range of possible models. Furthermore, TC helps reward originality in a way that FNC does not.

I’ve also done some analyses that are complementary to what you’ve done, but don’t seem quite as dire and add a bit of color to the situation.
These histogram of correlations of the scores of all users at a given era with the score at the next non-overlapped future era.
image
The above plot suggests to me that both FNC and TC are decent proxies for future TC, but the relationship is of course noisy.

image
I find the above plot quite interesting. It agrees with your analysis that, on average, FNC decently predicts future FNC. But, there are a large number of times where the relationship is strongly negatively correlated, <-0.1, many more than in the above plots involving TC.
So, all in all, TC still seems like a good payout metric to me and targeting good FNC scores still seems like a good way to get there.

Also, FWIW the TC code is actually fairly simple, it’s all in the original post.

Thanks again for the thoughtful analysis and further discussion is of course most welcome!

3 Likes

Can you tell me a round about amount of time in hours it took you too code this and put this post together? Could spark more discussion and a possible approval of the Bounty for high quality data science posts if the rest of the community wants to see more community members doing this type of work.

5 Likes

Goals extension

I tried thinking a bit more about the reasons of increasing the number of last/past rounds for determining the top quantiles, or for increasing the numbers of the next/future rounds. I think the following makes sense:

For the backtest we should take into account that we get paid for our results in individual rounds and not multiple rounds, but that these payouts do have effect on all future rounds. So for the backtest we are interested in adding more future rounds, but not per se in increasing the number of past rounds.

As participants we are interested in how we can most accurately estimate our future performance. We would definitely want to add extra past rounds to see if that more accurately predicts the future. For adding future rounds I think it depends a bit on your goal. Personally I update my staking ensemble on a weekly basis so I am interested to see how predictive a multitude of past rounds are to predict the next round. Other participants might prefer to re-decide their staking model on a less frequent basis. Those participants are more interested in looking at the past results of multiple rounds to predict the future results of multiple rounds.

So keeping this in mind I am going to show graphs of these 3 situations. I have relatively arbitrarily decided to set the number of multiple rounds at 5. This due to computational time required to create these graphs so not wanting to check too many options, not wanting to pick a number too big which would reduce the number of comparisons I can make and wanting too have at least one completely not overlapping round.

Graphs

‘Backtest’ / last vs next 5 TC

In here we can see again that FNC again seems most predictive, corr is doing pretty well now as well, mmc not that well. TC seems to be doing fine, up to the top 20%-5% (I cut off the top 4%-1% for all graphs due to instability of results due too few users in this top). The not so great performance of TC in the top quantile is not that great, especially due to TC seeming to be a metric which pays out a lot more top-heavy than the other metrics.

‘Participant evaluating, frequent rebalancing’/ last 5 vs next TC

Now FNC and MMC seem to be most predictive, followed by TC and then CORR. So for making your weekly rebalancing in the coming TC period, it might make sense to look at a combination of your FNC and MMC past results.

‘Participant evaluating, infrequent rebalancing’/ last 5 vs next 5 TC

… This seems not very in line with previous results. CORR and MMC performing exceptionally bad, and MMC performing off the chart (slightly), performing well at the top as well.

Explanation

Unfortunately, I don’t really have one. So I am hoping somebody else is able to make sense of it. A prime contender for the most likely cause would in my mind be a bug. But I think this might not be so. I have done some extra testing and also created the graph for metrics vs themselves and to me these non-TC graphs seem very plausible:

Details

The quantiles for metrics of multiple rounds are defined by looking at the best average performances by quantile (e.g. good score for MMC quantile would be 95%), not by the metric (e.g. good score for MMC would be 0.04) itself. An argument could also be made defining quantiles by the avg. metric score, but the way my code was set up it was easiest to extend it in this manner.

Code update

Because this potential bug is bothering me a bit I will refactor the code a bit and then edit the previous code (the 2nd and 3rd comment). If I do find a bug I will notify you guys

1 Like

@aventurine, thank you for lobbying for this/me :slight_smile: . I would have wanted to respond earlier but I felt like I had to first create the reply to Richard that I promised before and that took a bit longer due overestimating the time I would have and underestimating the time it would take to create that reply :wink: . Anyways, creating the initial forum post + code took me about 16 hours.

@mdo very nice to see some of the results, thank you. If I am looking at your results I think to see as well that FNC → TC has a bit higher average correlation, but also higher variance per round. I wonder if this is the same in the rounds where I looked at (I think your analysis takes all rounds up to a few years back?). Maybe this higher variance could be a potential cause for the confusing last 5 vs next 5 graph I produced. Will need to think / delve into that one a bit more.

This looks like a good result for TC to me.

The idea is that yes TC can be noisy in single rounds but taking the last non-overlapping 5 rounds (or longer) of average TC scores can give a good sense of how good the subsequent TC of a model will be.

Based on your results, it seems like a model with high TC on a TC-ranked leaderboard is more likely stay high than a high CORR model on a CORR leaderboard. I think this is a really good thing.

1 Like

The volatility drag associated from the weekly TC noise is a serious concern for staking and should not be smoothed for analysis as this alternative scenario will not materialize to realized returns.

6 Likes

The CoE will get together before Numercon in person and go over this and then possibly get the Bounty for High Quality Data Science Posts passed once we frame it the right way. Possible we can try to do retro bounty for this and in the future once the Bounty for High Quality Data Science Posts passes, community members would write up a short summaries of what they want to work on for a project and time required and we can do approvals before articles are written if that makes sense.

I understand the key point you are debating is the randomness and volatility of TC (or not).

But just a couple of comments on the “other problems”:

Could you elaborate on this? I’m not sure how payout correctness can be confirmed under any of the metrics.

There are some negative TCs up there, but many of them appear to be duplicate submissions. The number might drop significantly if you delete the duplicates. (Maybe this has an effect on the analysis?)

I’m not sure negative TC (or near zero, even more so) means the optimiser is unable to use the signal. If the signal is unique, then yes, the optimiser doesn’t want to use it. But if the signal is common, then it means it is already used optimally, or too much, and using it more won’t help.

Blockquote
Could you elaborate on this? I’m not sure how payout correctness can be confirmed under any of the metrics.

With this I actually meant the correctness of the scores/metrics. It would still be pretty hard but if a few participants who use 50 models were to collaborate they could use the predictions of their hundreds of model to see if there would be any possible realized results of the stocks we predict that make sense with the correlation score they got in the end. With TC something like this seems impossible to me.

Blockquote
There are some negative TCs up there, but many of them appear to be duplicate submissions. The number might drop significantly if you delete the duplicates. (Maybe this has an effect on the analysis?)

About the duplicates I don’t think it changes it that much, because I think any would already be very weird. Randomly eyeballing a few rounds I came up on round 295, where I am counting 6 unique negative TC scores in the top 20. Actually another thing which is pretty surprising is looking at kwak_09 and kwak_10 at round 295.


The names, the CORR, the MMC, the FNC, FNCv3 and Corr with meta model seem to indicate that these models are very similar. But the percentile scores are 29 and 78, going from -0.0159 to 0.0326.

Blockquote
I’m not sure negative TC (or near zero, even more so) means the optimiser is unable to use the signal. If the signal is unique, then yes, the optimiser doesn’t want to use it. But if the signal is common, then it means it is already used optimally, or too much, and using it more won’t help.

Many models at the top 20 also don’t have real stakes on it either. For example looking at the top 20 of round 295 again, you can see that none of these models have a stake higher than 0.021 so they are basically unused. So these signals are unused, and for some reason the TC implies they won’t help either. Also I went through them and all of them were also very unique as well, CORR with meta model all below 0.5 with many around 0.2.

And this is all just very weird to me. Models whose individual predictions probably highly outperformed the meta model and also being very unique signals are seen as not contributing to the meta model. These models, as indicated by MMC, are extremely valuable for improving the meta model’s performance. Normally I would think this is exactly what a hedge fund looks for, but apparently, according to numerai’s new metric to judge these models they would rather be a drag on the ecosystem instead of a great asset.

3 Likes

I agree that long-term stability is a good thing for the participant’s perspective. For the back test / hedge fund’s performance I think this matters less, since you pay out for weekly results, and not for the average of multiple rounds.

Although TC seems on a 5 round basis more stable than CORR, currently we are replacing MMC with TC and not CORR with TC, so I am not sure how important this is. MMC seems slightly more stable on this metric as well. If this type of stability is a big concern, then these graphs indicate that FNC is a very interesting choice because it blows the other metrics out of the water on this aspect.

Also, I would like to get the thing I pointed out in the other post under your attention as well. That I randomly stumbled on a model that performs hugely similar in a round on every metric, but very differently under TC. Specifically round 295 kwak_9 and kwak_10.

That might happen if the model, while correlated to the market, was quite different from the meta model, and that if one were to trace a path from one to the other, there might be local minima? If that were the case, then the slight perturbation used to determine the TC would result in a negative result, while a large perturbation might be positive.

As for your idea for those of us with large numbers of models collaborating, I would be happy to contribute. Right now I’m playing with genetic algorithms and looking at resultant (Spearman) correlations between my submissions. In the Tournament they are performing abysmally, but they are giving me some understanding of TC.

1 Like

Hey all. I updated the code because I had some fear of having created a bug earlier, due to some surprising results for me. Having updated the code I don’t think the previous contained any bug.

It also made me realize how few rounds I used to create some graphs. Originally I wanted to only use data after the latest change, the 10X data. This only allowed me to have rounds from round 285-304 in my analysis. Now if you want to take the last 5 rounds, skip 4 rounds due to that being the first round where those 5 rounds are available, and then take the average of the next 5 rounds. You basically only can do this for 6 rounds. (First round you can start is when you take 285-289 as the past performance rounds, and last one being where you take 291-295 as past performance and thus 300-304 as future performance).

Because of this I decided to recreate the graphs as well using the data from round 251-304, allowing me to use 41 rounds of data. This would reduce the variance of these graphs by a lot, but also might cause bias due to both old rounds being less relevant in general (e.g. not having the 10x data) and also I feel that the moments where a new change was introduced are unfair to compare percentiles over rounds with.

‘Backtest’ / last vs next 5 TC

‘Participant evaluating, frequent rebalancing’/ last 5 vs next TC

‘Participant evaluating, infrequent rebalancing’/ last 5 vs next 5 TC

Adding this extra data in general show that last year both Corr and MMC have not performed well on future TC. TC’s performance has increased w.r.t. the recent date in general. TC and FNC seem to have performed on a similar level, with TC performing slightly better. Maybe FNCv3 would outperform TC.

Metric vs own metric, last 5 vs next 5

This shows that in the last year TC was the least stable metric of the four last year.

Conclusion

Taking the whole last year it seems that TC is the best metric to optimize future TC when looking 5 rounds ahead and taking 5 rounds of past performances. FNC seems to perform on a similar level, but slightly worse. Corr and MMC don’t seem to work very well.

Past performances of TC seem least predictive for future performance for its own metric. So TC can be seen as the least stable of the four. So one can expect bigger variance in this metric then you were used to in the past.

I personally also expect some higher variance in TC compared to normal at the start of the implementation 9th of April, because people will be picking different models to stake on due to the new payout structure.

2 Likes

Please DM me a wallet address. Retro bounty will be queued up. Bounty for High Quality data science posts we are working the details but we might ask for your help in the next couple weeks to transfer this over to another section of forums or putting it in an article format for sending out in the newsletter but we will get back to you. Thanks for the detailed post! We will want to see more of these types of posts soon!

4 Likes