Relationship of daily round correlations to final round correlations

Is the relationship between the daily round correlations and the final submission correlations documented anywhere? I’ve noticed that the last daily correlation of a submission seems to always match with the submission’s correlation. This makes me curious what we are watching in the charts for individual rounds.

Related question and part of why I was looking at the daily round correlations - why don’t recent rounds have correlations in the API any more?

The first part of your question is unclear to me, but it sounds like you may have a misunderstanding of what the scores mean. In any case, can you re-phrase or give an example so we are sure what you are talking about?

And as far as the API, again not sure what you are referring to? (I pull the scores from the API every day no problem. We are talking about the Numerai tournament, not signals, right?)

Re: the first question, look at these screenshots from https://numer.ai/integration_test . The first shows the current correlations of the last 8 rounds. Then the next 8 switch to the round view, and the most recent date always matches the correlation from the first screenshot. I’ve noticed this on my models and a few others that I spot checked, so it doesn’t seem like an accident that the most recent date and the submission correlation match.

Round 237:

Round 236:

Round 235:

Round 234:

Round 233:

Round 232:

Round 231:

Round 230:

For the second question, this code in Colab outputs a list of round/correlation pairs, but the correlations are all None from round185.

napi = numerapi.NumerAPI(public_id=public_id, secret_key=secret_key)
sorted([(r["roundNumber"], r["submission"]["liveCorrelation"]) for r in napi.get_user_activities("integration_test")])

returns

[(168, -0.033684549296499416),
 (169, -0.05988718140443864),
 (170, -0.0553602620696002),
 (171, -0.049266679898113),
 (172, -0.03630719909101644),
 (173, 0.025985500504023682),
 (174, 0.019309163665860288),
 (175, 0.017878789613636287),
 (176, 0.024122615017301122),
 (177, -0.010733890025773116),
 (178, -0.04190927753899791),
 (179, 0.0073781890375614724),
 (180, 0.0015480821870788558),
 (181, 0.008458966549655554),
 (182, 0.0553799401902486),
 (183, 0.04973214110776211),
 (184, 0.053386781798264636),
 (185, None),
 (186, None),
 (187, None),
 (188, None),
 (189, None),
 (190, None),
 (191, None),
 (192, None),
 (193, None),
 (194, None),
 (195, None),
 (196, None),
 (197, None),
 (198, None),
 (199, None),
 (200, None),
 (201, None),
 (202, None),
 (203, None),
 (204, None),
 (205, None),
 (206, None),
 (207, None),
 (208, None),
 (209, None),
 (210, None),
 (211, None),
 (212, None),
 (213, None),
 (214, None),
 (215, None),
 (216, None),
 (217, None),
 (218, None),
 (219, None),
 (220, None),
 (221, None),
 (222, None),
 (223, None),
 (224, None),
 (225, None),
 (226, None),
 (227, None),
 (228, None),
 (229, None),
 (230, None),
 (231, None),
 (232, None),
 (233, None),
 (234, None),
 (235, None),
 (236, None),
 (237, None)]

I can’t answer your API question exactly, but that must be a deprecated field you’re pulling. (I get correlations from v2RoundDetails but I don’t use numerAPI – might want to ask that in the rocketchat “api” channel.)

As to the first question, it does seem like you’ve got a fundamental misunderstanding there because of course they match – the most recent daily score and what you are calling the “submission’s correlation” are the same thing. The daily scores you see on the “round” dropdown for each round are snapshots in time of that day. They are not cumulative or anything like that. Each day your predictions are compared to the live results as they stand on that day (which is actually a lag of 2 days from the live stock market, but that’s another complication). So only the last day of the round when it “resolves” (i.e. the 20th score after 4 weeks) actually means anything – that’s the day you are scored on for payment. All the other days are just something to look at in the meantime. So for each round you have 19 scores and “payouts” that tell you WHAT YOU WOULD HAVE GOTTEN FOR THAT ROUND IF THAT ROUND HAD ENDED ON THAT DAY. Of course it didn’t end on any of those days, so again they are just something to look at and follow along with as the round progresses. Only the final 20th day means anything. And so that explains why the most recent score is always listed on the “submission” page – the submission page is simply the summary of your final scores (for all rounds except the most recent 4) and also in-progress scores on the 4 most recent rounds (which are open, not resolved, except on Wednesdays which is the last day of a round then the 4th one back is a final score for that round).

SO… exactly one score per week actually is final (i.e. is the only one that counts for real payment or burn) – this week it was yesterday’s (Nov 11) score for round 233 as that is the round that finished/resolved this week. And then today (Nov 12) we got a score for round 237 for the first time and so rounds 234-237 are currently open and still in-progress.

Starting to make sense?

4 Likes

This is exactly why I asked if there was documentation!

This was actually my understanding before posting, but I wanted to get official (or veteran) confirmation of that understanding. Just looking at the round charts, it is very easy to assume that those are the correlations of that round on that day’s data. Because the label is “correlation”, not “correlation update”, “correlation so far”, “cumulative correlation” or “correlation snapshot”. I looked at those charts for months before realizing I was reading them wrong.

The closest thing I found to a definition in the official docs is just “Each submission will receive daily updated scores starting from the first Thursday after the submission deadline to the Wednesday 4 weeks after.” and “But only your final score and final payout will count.” The latter nixes the naive interpretation just looking at the official chart labels, but does not say how the intermediate updates are calculated. Two obvious candidates to me are calculating over resolved days, and assuming zeros for unresolved days. The former seems a little more intuitive to me.

So just clarifying – that’s exactly what it is. The correlation of your predictions to the state of the market on that day (with 2 day lag from the real life market – Wednesday scores are from Monday’s market, etc). It is just that if it isn’t the final day of the round, that day doesn’t mean anything. When figuring final scores, your intermediate daily scores don’t enter in that calculation whatsoever – again they are just there to look at, nothing else. It used to be we got a single score for each round after waiting a month and having zero idea how the round was shaping up. So that made people crazy with anticipation and now we’ve got something to pass the days. But still, the only score that matters each week is that final score on Wednesday for the 4th most recent round.

1 Like

Ah, that’s where I got confused. The daily labels tricked me into thinking some of the round was resolving early (i.e. not everything was predicting four weeks out). The updates are more “pretend we resolved the same bets early”. I guess some predictions could be less than four weeks long, but not a helpful way to think about it now. Thanks!

Yes, correct. I’m not sure we’ve ever got a definitive answer on whether anything is actually resolved before the final date, but it doesn’t seem like it. “Pretend it is all resolved” each day before the final day is exactly right – it is just pretend until the final day. Usually your scores track and what you are getting in the final week is pretty close to what you are going to end up with, but not always. Just this week there was a big change for many on the final day as it corresponded with huge market shift on Monday (vaccine announcement, I think). So there is a component of luck there too…

I don’t believe daily scores are indicative of your final resolved score until at least the 15th day (3rd week) of each 20 day (4 week) round. In fact, I think we put way too much weight into daily scores. Here’s my unscientific analysis to answer the question:

This chart shows a different line for every round. The y-axis shows how far each day’s score is from the final score your model gets on that round. The x-axis shows which day of the round you’re on. On the final day of the round, each round’s lines converges to 0 because that is your final score! The dashed line is the average distance for each day over all rounds. Although the average distance of daily scores from final scores over time looks to be 0, that’s only because it’s completely random whether or not my daily scores are higher or lower than what my final score will be.

What’s more important is the absolute value of the difference in your distance from your final day score, which looks like this chart below. Clearly, it’s downward sloping. What this tells me is that on the first day of every round, my daily score will be ~0.03 correlation points away from my final score. It’s not until roughly the 15th day that my daily scores are within 0.01 correlation points of my final score. In some cases, even on the 15th day, my daily score can be as much as 0.04 points away from my final score.
image

And here’s the code if you’d like to check your own models’ “consistencies.” I’ve found most models exhibit the same behavior, though. There is likely something interesting to be found in different models’ changes over daily scores. The same analysis can be done for mmc by changing all references to “correlation” to “mmc”:

napi = numerapi.NumerAPI()
df = pd.DataFrame(napi.daily_submissions_performances("jrai")).set_index("date")
df = df[df["roundNumber"] < 233]

df["distance"] = (
    df["correlation"] - df.groupby("roundNumber")["correlation"].transform("last")
).values

df = (
    df.groupby("roundNumber")
    .apply(lambda x: x.reset_index(drop=True))
    .drop("roundNumber", axis=1)
    .reset_index()
)

#plot distances
df.set_index("level_1").pivot(columns="roundNumber", values="distance").plot(
    figsize=(10, 5), title="Daily Scoring Distance from Final Day Score"
)

df.groupby("level_1").mean().distance.plot(style="k--")

plt.xlim(0, 20)
plt.legend(bbox_to_anchor=(1.4, 1), loc="upper right", ncol=3)
plt.ylabel("Distance from Final Day Score")
plt.xlabel("Days into Round")
plt.figure()

#plot absolute distances
df.abs().set_index("level_1").pivot(columns="roundNumber", values="distance").plot(
    figsize=(10, 5), title="Daily Scoring Absolute Distance from Final Day Score"
)

df.abs().groupby("level_1").median().distance.plot(style="k--", legend=None)

plt.xlim(0, 20)
plt.ylabel("Distance from Final Day Score")
plt.legend(bbox_to_anchor=(2, 1), loc="upper right", ncol=3)

plt.xlabel("Days into Round")
plt.figure()

Edit: I just realized that the absolute distance graph is actually showing the median distance, which might be a better measure than mean distance anyway. I was switching between mean/median and forgot to switch it back. Change any reference between “.median()” and “.mean()” to see the differences.

30 Likes

Thanks @jrai, great stuff! A related question to the usefulness of the daily scores you might know the answer to:

how reliable is the qualitative difference in daily score between two models submitted in the same week? (i.e. is model X that looked better than model Y in week 1, indeed better than model Y at round resolution). Do you happen to have an analysis/intuition on that as well?

The above question makes sense under the assumption that Model X and Y are models that are build in compatible ways and with compatible aims (basically hyperparameter tuning), i.e. not a P 1-P model kind of model :-).

1 Like

Hi all -

As a relative Numerai newcomer I’ve been trying to understand the daily scores as well.
Based on the limited docs, the posts in this thread, and my own explorations, I have another interpretation that I thought I’d share here for discussion.

Note I’ve only really considered the daily corr score in this post, but I suspect that mmc follows similar logic.

To my mind, easiest way to explain why the daily score converges on the final score (as per the plot from @jrai) is that they are indeed cumulative. In other words, the daily corr score for day N is the mean daily corr between each day’s predictions and that same day’s actual targets, averaged across N days.

This interpretation goes against the following claim from @wigglemuse:

But actually jibes with their emphatic later claim:

There’s a logical basis behind this interpretation of the daily scores. If there are 20 days in a competition round, why would Numerai reward only the last day’s performance? It makes more sense to reward overall performance across the full 20 days, which suggests that the daily score on the last day–aka the final score–somehow includes contributions from the daily scores of all days. This and the observed convergence of the daily score to the final score all seem to point to the daily scores indeed being cumulative.

Assuming my interpretation is true, I wrote some code to extract daily corr values from these cumulative daily scores. On day 1, the daily corr score would actually be day 1’s mean corr since there is only one day of data available. But day 2’s daily score = (day 1 corr mean + day 2 corr mean) / 2. Solve this equation for day 2 corr mean and iterate across the remaining days and you can extract the mean corr for each day in the round.

Below is a plot of the daily scores for the 4 most recent completed rounds for the @benchmark_models model. I have not centered them on the final score as was done in @jrai’s plot.

Here is a plot of what I suspect are the daily corrs for this model, extracted using the approach outlined above:

A few things to note:

  • The daily corr values range roughly between -0.2 and 0.2 for this particular model.
  • If this model hasn’t changed across these 4 rounds (which is the assumption given that it submits the example predictions) and if my interpretation and math are correct, then Numerai must use somewhat different data to evaluate each model on a given day in a given round since the daily corr values are not identical.
  • Diving deeper into the previous conclusion, there is likely some overlap in the data used to evaluate the models on a given date since so many appear to make correlated moves - aka notice how often the direction of the daily change (positive or negative) is consistent across rounds.

If this interpretation of the daily score is accurate, aka that it is a cumulative mean score across all of the finished days of a given round, users could use this daily corr extraction process to understand a little better how their models perform in the wild. They could compute their own daily mean and sharpe values from these live results thereby getting better metric data than is currently available from the end-of-round numbers and the metrics computed upon model upload.

Of course this might all be wrong so I’d love to hear what you all think.

Whether or not I’m right or wrong I have a suggestion for Numerai: you provide us with a dashboard of information to help us understand how our models are doing. Daily scores are a big part of that information. Given the ongoing confusion about what these daily scores actually are, please consider updating the documentation with an unambiguous definition! I was unable to find one.

Cheers,

PRC

3 Likes

I’m afraid not. They are not cumulative – at all. This has been asked directly to the team and answered unambiguously (since I posted about it above). I could even find it for you on video if you give me some time. The intermediate days have no bearing on your score. Another way to put this is if we had two alternative universes and the market snapshot was identical on the final day of scoring but the intermediate days were all completely different, would you get the same score? YES. Do they even need to have the intermediate market data to calculate your score? NO. Only the last day matters

As far as the docs, I’m right now working on a comprehensive FAQ and other materials that answer all these common questions and confusions. Hoping to the get the first part of it up today, but annoying things keep happening in my house the last couple weeks to delay me. (Watch this section: Understanding Numerai - Numerai Tournament)

1 Like

Thanks for your input. Would be interested in seeing the video. Would be more interested in seeing the Numerai folks answer these basic questions in the docs instead of having generous users like you have to assemble an FAQ. If these Qs are FA, after all, then it’s probably a sign that the documentation is lacking.

Also: do you have any theories about why the daily score converges towards the final score if what you say is true about the intervening days not mattering?

As far as the docs, they kind of enlisted me for that, so me making an FAQ is in part their action and in part mine (and is hosted on their site). So they did recognize the need.

Why do the scores converge? Because each daily score is generated from a snapshot of a market day, and any given market day is most likely going to be pretty similar to the day before it unless there was a major market shock. And so the score of the day prior to the final day is most likely gonna be closer to the final day score than the day before it, and so on. The farther back in time you go, the less likely you are to be sitting on the final day score. That is all the graph is showing. You always end up with the final score after all – just because it is the final score.

This would indeed explain convergence towards the final score.

However, if it is true that the differences between consecutive daily scores are the result of typical daily market variation and not the averaging that I proposed, then we would expect to see statistically similar average variation between any two consecutive daily tournament scores.

In other words, if your explanation is true, then the market delta between days 19 and 20 of a tournament are not “special” aka they should not be statistically different from the delta between days 1 and 2 or any two consecutive days for that matter, barring the major market shocks you reference (and perhaps other issues like weekend breaks and so on).

Luckily, we have plenty of data to consider to investigate this. I took all 82 completed rounds of @benchmark_models and computed the daily score change. Here is a plot of the standard deviation of that change per tournament day over all 82 rounds.

Screen Shot 2021-03-07 at 9.27.03 PM

This appears to counter your claim that the change between the last two daily score corr values is due solely to daily market movement, unless magically the market moves less on the final days of a Numerai tournament than it does on the starting days. Which of course it doesn’t.

Below is a plot where I generated 82 Gaussian-distributed fake “daily corr” values and computed the daily difference between them, then plotted the std of those differences. It comes out much more like what I’d expect from the daily scores if what you wrote was true, namely, that the delta between consecutive daily scores has nothing to do with the tournament day and is more of a measure of typical daily market movement.

Screen Shot 2021-03-07 at 9.27.12 PM

So just to re-make my case, since your explanation for the convergence isn’t explained by the data, here is a plot where I took 82 rounds of Gaussian-distributed fake “daily corr” values and averaged them over each day, aka, I’m simulating what I proposed in my post about what I think the daily scores might really represent. You’ll see an std curve that gets lower over the course of the tournament, as we saw in the first plot, and as one would expect if the convergence is due to temporal averaging rather than daily market movement.

Screen Shot 2021-03-07 at 9.27.20 PM

In the spirit of getting to the bottom of this, there is at least one explanation that would reconcile both the convergence over time towards the final score, the ever-shrinking std of the daily scores, and your belief that the score on the final day is truly independent of the scores on the previous days.

That explanation is this: each model is evaluated using 20 different days worth of data. On day 1 of a tournament, only some of day 1’s live data is used to compute the corr, the rest is embargoed for later use. On day 2, only some of day 2’s live data is used (and the average is computed between days 1 and 2 and presented as the daily score). This continues until the final day when all the previously-embargoed live data is folded in to one big final score calculation. The idea here is that the model performs similarly enough on the separate chunks of each day’s data that the daily scores are predictive/indicative of the final score, but the final score is also truly independent of the prior scores.

Obviously enough data could be embargoed that each day’s score could be computed the same way (aka, day 2’s cumulative score includes embargoed data from day 1 and some from day 2) thus keeping all daily scores truly independent of each other.

Maybe this is it, maybe not. Regardless, I hope you see why I’m not ready to accept your explanation for the convergence. It’s simply not supported by the evidence. So I’ll re-state my desire to have Numerai officials give us the real answer, or tell us why they won’t and thus keep this investigation going!

2 Likes

Since I originally asked this question, this has come up a number of times in Rocket Chat without correction by the team, and @wigglemuse satisfied me earlier that I was misinterpreting what was going on.

Specifically in regards to what you said here,

I am confident that this is incorrect. It might be correct if the hedge fund was only looking at stock prices. However, it does not work if any derivatives with expiration dates are involved. For options specifically, the premium baked into the price will vary based on the time remaining before they expire, and will tend to get smaller as the options get closer to the expiration date. I think that by itself will lead to the smaller price movements and lower daily variation leading to the decreasing trend you see.

Thanks for raising this possibility. Varying expiration dates of certain financial instruments could certainly explain a wider variance in the daily market deltas than I included in my simple simulation.

I’m not ready to accept “it does not work if any derivatives with expiration dates are involved” because after all, if a small percentage of the fund’s holding are in expiring derivatives they wouldn’t move the needle very much.

But let’s assume for a moment that fund is exclusively built from these kinds of expiring instruments. And since we observe steady convergence in every round even though rounds X and X+1 overlap for 15 days, I believe this also would imply that each tournament round must have its own independent holdings that expire during said round.

Yours is certainly a viable theory when combined with assumptions like this about the nature of the fund’s holdings. Perhaps those assumptions are why the Numerai team isn’t chiming in to answer this question about the daily scores, but I’ll still continue the plea. Courtesy of the The humans of Numerai thread, I’m calling out some Numerai staff explicitly here, wondering if @slyfox, @master_key, @mdo, or @son_sioux could offer more insight and/or help us rectify some of the contradictory theories that have appeared in this thread.

What remains clear is that the early daily scores are not very predictive of the final score. Just look at @jrai’s original plot. So why are they presented to us at all? What kind of value are we supposed to get from them if they have no bearing on the final score as some have claimed, and (as of yet) no unambiguous meaning? I would hope the staff could offer a “safe” answer to that question that does not expose information about the fund’s internal operation. I mean, if you’re going to put a speedometer on the car, please let us drivers know how to read it. Or maybe just add some error bars based on the tournament day so that we know how much faith to put into those scores.

Thanks, all, for your continued input!

1 Like

Because the previous setup we had zero feedback for 4 weeks. This is better than that.

1 Like

Given that we are aiming to predict the state of play for 4 weeks time and not for every day leading up to that point, is there any reason why we would expect early scores to be indicative of the outcome? The day that’s most likely to look like the final one in the markets relative to the starting point is the one before, and the further away from the goal you go, the more discrepancy there would inevitably be. Or am I missing something with this reasoning (which I accept is entirely possible!)