Relationship of daily round correlations to final round correlations

@profricecake If this is a concern, you should not be participating in the tournament. In contrast to the very long conversation in Rocket Chat yesterday (and many times before) about trusting third party model staking where there are big trust issues between users, this is about trusting the system.

1 Like

That makes a lot of senses. But for the sake of transparency, revealing the score calculation method should not be a red-line vis-à-vis the hedge fund operation. Transparency is heart of the defi, right :grinning:?

We don’t even know what the target is. So that could have a lot to do with it.

I did look at the data for ALL models and then filtered in a few ways but the basic pattern always holds. First day average change is much bigger than others, and then generally decreases. I think a lot of us have noticed that the first day is way off, and will often reverse the second or third day before it gets even a semblance of a trajectory towards where it is going to end up. Probably down again to how the target is created. We have once in a while we’ve seen some radical changes in scores on the very last day or two though…

I’m not reading into this discussion that anyone is questioning the integrity of the team. And I very much appreciate your insight and discovery @wigglemuse however, while you mention that the issue is pretty much settled, I cannot find any information in the official documentation regarding details on the subject. I trust that you did ask and that Richard did respond, but it seems reasonable to want these details in a readily available point of reference as part of the official documentation, rather than to have to scour the forums and chat rooms. Especially as a newcomer.

What @profricecake has proposed seems like an utterly reasonable possibility for a method of calculation. It may not, in fact, be what is happening but I, for one, would love to see clear easily accessible documentation from the official numerai team, which doesn’t seem like a big ask. If presenting that level of detail would pose some level of data leakage or security risk, then stating so would be sufficient. It’s in all of our interests for each of us to have great performance, so I certainly don’t think anything sneaky or untoward is going on. Probably just a squeeze on resources. I love the tournament, and the community. I believe a push toward more clear and available official documentation gives all of us some bedrock to stand on and would allow for our guessing to focus more on how we can improve predictions and less on the mechanics of the tournament.

My point is that they’ve only ever said it is one way, so there isn’t really any more detail needed in order for it to be the plain truth. The need to clarify it only comes from users essentially asking “Is the way you say it is actually the way it is?” They obviously aren’t going to document every esoteric detail that it is NOT. Things get documented above and beyond the basics when they are continuing questions about them, and that applies here. But the questions come before the documentation because otherwise it is not known which things are points of confusion that need additional clarification. In any case, I am working on that exact documentation, and should be doing that right now instead of posting on the forum. So once that project is more or less complete (complete enough to post anyway), if and when any more questions about esoteric details come up that I haven’t answered, I will be then happy to add to it (with a quick response time) pretty much any reasonable question people have and then it will be there to point to. And in fact, that’s exactly why I made sure to clarify this once and for all directly with the team, because anything I put in those docs I will make sure is correct.

So the entire point of my documentation project is specifically so users (new users especially) do not have to scour all of these sources for all these questions they have, and it will be in one place. We started the project about a month too late, because just as we got this flood of Lex Fridman podcast newbies flooding the place with their questions was about the same day I started working on that. I wish it was ready already, but it will be quite soon.

Even Richard thinks it should say it better on the website. Here’s the bit of the video I was referring to from last fireside chat. (51:06)

Guess it won’t jump to the right place when embedded, if you open up on youtube and expand the “show more” you can see all the questions and that one at 51:06

Hi all.

I came to this thread in search of explanations. I’m not making accusations, I’m offering theories and challenging those that are either unsupported by evidence or that don’t seem to jibe with my readings of the data (or both).

Please try to remember that you Numerai vetrans were here once too. Not everyone participating in the tournament has been exposed to the same information about the tournament. Apparently some of you have been in direct contact with members of the staff, others like me have just started their involvement and know only what’s on the website.

First of all: I’d love to watch it! It sounds like it has all the answers. But you still haven’t shared it yet.

Second: Hiding things is a core part of the Numerai concept. They strip all identifying materials from the data set to keep the competition more data science and less finance. I understand and accept this. They’ve documented it clearly on the website. But the daily scores are the opposite of hidden: they’re front and center for all the world to see, but they are only thinly documented (hence this thread).

Third: Conspiracy? Why did you bring that word up? Just because I’m not ready to accept your claims without evidence? Again, please try and remember that I haven’t had the same access to the primary sources that you’ve had and - like you - just want to hear answers from a source I can trust. Just because you feel that you understand things with great certainty doesn’t automatically make you a trustworthy source, since you are not a Numerai employee nor have (yet) offered any evidence beyond hearsay in support of your claims.

If you post that video that you keep referencing, I would greatly appreciate it. Then it will be available for anyone who has the same questions in the future and who want to hear it straight from Richard’s mouth. This is less ideal than an update to the official Numerai docs but still a big step in the right direction.

Looking ahead, I hope you plan to cite (and make available) your sources in your upcoming FAQ. This is historically how knowledge is built, and for good reason.

Thanks for sharing this. I didn’t know that there was a complete lack of feedback in an earlier incarnation of the tournament. Daily feedback would certainly be an improvement on that, for sure, even if the early days are not indicative of the final score.

This is pretty funny. The thing I’ve asked for most in my posts is a trusted source of information (like Richard) to chime in on this topic. Please don’t misinterpret my skepticism of unfounded claims made by Numerai outsiders as a lack of trust in Numerai’s leadership. There is no connection between the two, which is of course the source of my skepticism in the first place.

Aha! I see the video was posted while I was drafting my last message. I’m looking forward to watching it, @wigglemuse. Thank you for sharing it.

My FAQ may have a few links in it to existing materials, but it certainly isn’t going to source and footnote every little thing as that would be almost impossible. But no implementation details I describe will be in there that I’m not sure about or haven’t verified. And it is all public info – no insider access required other than occasionally asking an insider about something (which they are only going to answer if it is ok to be public). But feel free to ask about any additional details, or if something remains confusing, that’s the whole point.

The video answered some questions unambiguously for me so I thought I’d share what I’ve learned here so others might benefit.

  • Although daily scores are provided, Richard confirmed that “ultimately you get scored on the last day [of each round] only”, and “there’s nothing in the middle that could affect your performance.” Although he did go on to acknowledge that companies going bankrupt or some other large-scale market disruption could certainly change performance during a tournament round.

  • The daily scores, as Richard describes them in the video, are “just an estimate” of what your final daily returns will be if Numerai was forced to give you an estimate even many days out from the actual tournament end.

These statements resonate with the observations from this thread that it’s easier to make a prediction when you’re closer to the actual scoring day, and hence why the daily scores seem to converge towards the final score (because, as many have noted, the market state on day 20 is likely to be more similar to the market state on day 19 than on day 1).

Based on this information, I’m confident that the idea I posted about daily scores being daily correlations is off-base. Here’s to learning!

In the wake of this informative thread, I would like to make three simple suggestions to Numerai to disambiguate this daily score stuff for others in the future.

  1. Call them “daily estimates of your final score” instead of “daily scores” (or something similar that would highlight the fact that they’re estimates)
  2. Add error bars to the website graphs based on what I’m sure is plenty of available data on the variance of the estimated scores relative to the final score. Bars would vanish for all completed rounds, of course, but the IP rounds would each have them, and they would grow progressively larger as we approach the latest round.
  3. Revise some of the documentation under Scoring on this page.

In an attempt to be helpful regarding #3, below are some suggested revisions to the existing docs.

Here’s a current (confusing) paragraph:

Each submission will be scored over the ~4 week duration of the round. Submissions will receive its first score starting on the Thursday after the Monday deadline and final score on Wednesday 4 weeks later for a total of 20 scores.

Here is a clearer and more informative version based on what I gleaned from the video:

Each submission will be scored on the final day of the ~4 week round. Submissions will receive estimates of their final score starting on the Thursday after the Monday deadline that will continue until Wednesday 4 weeks later when the final score will be released. These estimates will be provided on what we call “scoring days” (weekdays M-F minus market holidays). The estimates tend to grow more accurate as predictors of the final score as the tournament round draws to a close, but they are merely estimates. Only the score on the final day counts for the competition.

While I’m at it, I’ll offer a revision to the next paragraph too. Original:

Since a round takes ~4 weeks to resolve, if you submit new predictions every week, you will receive multiple (up to 4) overlapping scores on each scoring day from the 4 ongoing rounds.

Proposed revision:

Since a round takes ~4 weeks to resolve, if you submit new predictions every week, you will receive multiple (up to 4) overlapping score estimates on each scoring day from the 4 ongoing rounds.

Thanks to all who offered their input on this one!

10 Likes

Maybe I’m just beating a dead horse by bumping this… But here’s my take, code below.

Daily score is nothing more than the correlation between prediction vs realized target on that day. The realized target is some form of cumulative return until that point (might be market neutralized, scaled by risk etc).

So it can be fairly well modelled by 5000 (or however many stocks are traded) brownian motions. Each day these are ordered and that order is compared to your ordering.

A change in daily score is not triggered directly by returns of stocks, but by the change in order caused by the change in cumulative returns. As @wigglemuse points out the score will per defintion converge to the final score.

Now to the fun part… The reason why the absolute changes in increments (or std for us preferring l2 norms) decrease as the round progresses is because the average distance between the 1d diffusion processes increase. Higher distance = lower chance of change in the rank correlation between prediction and target.

The standard deviation of increments in daily scores goes down in magnitude by approximately 1/sqrt(day of round + 1).

Code to a simple monte carlo simulation of this:

7 Likes

Thanks @jrai this is really useful. :ok_hand:

Despite how little we know about early daily scores, our obsession with them will forever endure. Here’s a quick update on this post including Signals data and some better code in a colab notebook so you can test your models, do comparisons, and more/better analysis: Google Colab

The high level conclusions are still the same:

  • you can expect roughly .02 - .03 correlation difference from the first day of a round’s score and the final resolved day of a round’s score (aka your actual score which is the only one that matters) on average but could range much higher than that.
  • scores are only somewhat informative around the 15th day into a round

The figures show a single model and a different color transparent line for each round with the daily scores distance from the resolved score for that individual round as the round progresses. Then we chart an average line (in red) and a band of +/1 standard deviation across all rounds. First Signals:


And then Classic Tournament:

For these two models across the tournaments and the same rounds 279-292, Signals and Classic daily distances are similar on average, but different rounds exhibited very different daily distances between the two tournaments (which is probably a good thing if you want to diversify risk by competing in both).

Comparing across some top Classic leaderboard positions, and pulling in more rounds so we can get a clearer picture of the mean (20d corr is only available for Signals starting at round 279), we can also see some more volatility (i.e. higher early daily score distances from final score), which may be a common factor at the top of the leaderboard?

We can see the same looking at some top Signals leaderboard positions (onlyatest for example is just submitting ranked momentum predictions and is understandably very volatile):

Shoutout to @robo_boi for having some incredibly volatile Signals models, anything you are willing to share about why? My guess would be testing out single features? Perhaps not, because @arbitrage is also submitting a single feature to his model “leverage” and it has some of the lowest distances on average I’ve seen:

Questions to still answer:

  • Is there a relationship between rank and initial distance from final day score (i.e. volatility)? What about between FNC rank and initial distance?
  • Same question that @bor1 asked: how reliable is the qualitative difference in daily score between two models submitted in the same week? (i.e. is model X that looked better than model Y in week 1, indeed better than model Y at round resolution)"
    **I did find that MMC tends to follow a slightly tighter path, so generally I think percentile ranks can hold a bit more steadily through time, but it would be interesting to chart that out too.
10 Likes

My guess is that my factor is very similar to one of theirs, but sufficiently different that I can still register a non-zero corr. I’d further guess that if I were able to calculate my measure for the entire universe that my score would increase in volatility. Good analysis @jrai these posts are always very interesting!

That makes sense, good point. Submitting a smaller universe almost certainly brings down daily score distance volatility (just because a lot more 0.5s diluting everything in the first place) and submitting a heavily neutralized factor would also probably decrease intraround volatility. Conveniently, the next post is about neutralized corr vs unneutralized corr (UNC) in signals. With UNC eventually in the API, we can go more in-depth and look at everyone’s “neutralization effects.” I think the range we see is going to be pretty interesting.

Cool analysis - thank you!

Would love to dig into this a little more deeply. Can you put any numbers to “somewhat informative”? How’d you pick day 15?

You’re using absolute distance; are the scores N days prior to the final day equally likely to be below as they are to be above that final score? (aka is there a trend up or down as we converge to the final score?)

Thx

Good question, it’s completely made up. Day 15 is generally where I saw the average absolute distance at around 0.01 where std also accelerated its shrinking to +/ .01. At that point, I felt like it deserved to be called “somewhat informative,” but it was a pretty subjective landmark.

When I first did the analysis, the daily distances looked to be about 0, but now with many more rounds included and a check across more models, it actually looks like very early daily score distances have a significant negative skew, especially in the extreme cases (i.e. see the standard deviation bands). In other words, I guess scores may trend up, on average, as we converge to the final score. It may be possible this negative skew is just from a few extreme rounds in this sample, but it also makes sense that predictions would get “better” as they get closer to what they’re trying to predict in the first place.

5 Likes

Thanks for the extra info!

BTW are your ‘scores’ for your classic tourney data pure corr values or are they corr/mmc combos?

Since you clearly have the data loaded and available, I wonder if you might be willing to go further with this.

I suspect I’m not alone in hoping for positive scores each week, and that I value any positive final score greater than any negative final score. So in the wake of your latest post I’m curious:

  • If early scores are negative, what chance do I have that they’ll turn positive? And of course the opposite: if early scores are positive, what are the chances they’ll go negative? We could call this the “outcome reversal” probability. It is of course 0% on the final day of the round, but what is its value on day 15, 16, etc?
  • At what day during the round do most (say 95%) of the daily scores lie on the same side of zero as their final day counterparts? Aka, if my score is negative on day 15, what are the odds it will still be negative on the final day too? I guess this is just 1.0 minus the outcome reversal odds, so it’s really the same number.
  • Does the negative skew on early scores still exist if you separate the data into two chunks: final scores that were net positive (earns) and final scores that were net negative (burns)? Testing this would help support (or not) your intriguing suggestion that perhaps the negative skew is because our predictions get better as we get close to what we’re trying to predict.

Posted from wrong account.

This is pure corr. MMC tends to have less volatility, but is an interesting analysis on its own. Also, all good questions that hopefully I/we can answer soon. All of the code for this post is public in this notebook if you want to play around: Google Colab

Thx for the Colab link.

Here are two really simple plots (using data from your jrai model) where I estimate the probability for each scoring day of the final score being a complete reversal of the daily reported score. In other words, if on day 1 the scores are positive, what are the odds that on day 20 the score will be negative? And so on throughout all 20 days.

Here is the plot for final scores that were positive:

posswap

And here is the plot for final scores that were negative:

negswap

They seem pretty similar on the whole. Here are my takeaway points:

  1. There is roughly a 35% chance that your final round score will have the opposite sign of your day 1 score (aka a day 1 burn has a 35% chance of becoming a day 20 earn, and vice versa)
  2. By approximately scoring day 11, you’re down to a 10% chance of a complete reversal.

I’d like to continue this analysis with magnitude of the correlation in mind. Aka, I suspect that higher absolute value corr values (either pos or neg) will result in less fate swapping by the end.

4 Likes