Are predictions discrete or continuous?

Hello,

Noob here. Inspecting the file example_predictions_target_kazutsugi.csv, I see a column with decimal numbers. For example:

n0003aa52cab36c2 0.48416
n000920ed083903f 0.47641
n0038e640522c4a6 0.53401

However the training and validation data has discrete values for the target, i.e. target_value ⊂ {0,0.25,0.5, 0.75, 1}. Also, the number of rows is very different: Whereas there are only about 5000 data_type=“live” rows in the tournament_data, there are about 1.6 million lines in the example.

Clearly I am confused about the meaning of either the target or the example_predictions_target. Could somebody clarify what the format of the predictions should be?

The example predictions file you are looking at is a valid submission file. You could upload it right now and it would be accepted. So that’s what it is supposed to be like – it includes ALL of the rows in the “tournament” data file (i.e. submit predictions for that entire file.) But the only new data each week is the “live” data, so if your model doesn’t change from week to week you can work out your system so you don’t need to predict the entire thing every time. Just depends how fast your model runs and the resources it needs if you need to fuss about that. (Last week’s live data is added to the end of the tournament file each week as a new test era.)

As far as your predictions, they should be in the range of 0-1 just like in the example file. But yes, the training data only uses 5 discrete values/buckets for the targets. Nevertheless, your predictions should be real valued and ideally not contain any ties (they will be broken by row order, i.e. essentially randomly).

Oh, and in about a week we are getting new validation eras and moving to a target with a different distribution (but still only has discrete training values) so what I’ve just said about the file being relatively stable won’t be true on the round that starts Nov 14. On that round, you’ll see some test eras disappear and some new validation eras be added. Keep on eye on this forum and the rocketchat for the latest changes.

Thank you for the detailed reply @wigglemuse. So just to clarify, what is the intuitive meaning of each target value? What would the following row mean

n0003aa52cab36c2 0.48416

with respect to 0, 0.25, 0.5, 0.75, 1?

your predictions should […] not contain any ties

What are ties? or what would be an example of a prediction containing ties?

You are scored on ranking correlation PER ERA. Each era has about 5000 rows. When I say you should not have ties I mean for that 5000 row era you should have 5000 different values, not 1000 0.0s, and 1000 .0.25, etc – each of your predictions should be unique. (So to answer your original question predictions are continuous and not discrete even though the training targets are discrete.) And again, you are scored on rank so the values don’t really have intuitive meanings, only the order of them. If there were only 5 rows in the era, then 0.1, 0.2, 0.3, 0.4, 0.5 would get the exact same score as 0.6, 0.62, 0.8, 0.85, 0.99 because they are in the same ranking order (and with no ties).

I’m afraid it’s still unclear to me where the advantage of unique (or strictly ordered) predictions lies. Perhaps I don’t quite understand what we are trying to predict.

To try to flesh out my question, assuming the numerai_score is indeed calculated as:

rank_pred = y_pred.groupby(eras).apply(lambda x: x.rank(pct=True, method="first"))
numpy.corrcoef(y_true, rank_pred)[0,1]

We are trying to find the pearson product-moment correlation coefficients between the target and our prediction. To look at a concrete example. In the numerai example data, 400 rows of target in the first era look like this:


and the prediction of a linear regression looks like this:

which after ranking becomes:

We then calculate the correlation.

Why wouldn’t a perfect discrete prediction that matches exactly the target work best? Let’s take a toy example where my model is god-like and predicts the output perfectly:

t = numpy.array([0,0.25,0.5,0.25,0.25,0.75,1]) # target
pre = numpy.array([0,0.25,0.5,0.25,0.25,0.75,1]) # prediction

If I rank and calculate correlations with:

pre_rank = pandas.DataFrame(pre).apply(lambda x: x.rank(pct=True, method="first"))
numpy.corrcoef(t, pre_rank.T)[0,1]

I get a numerai_score of 0.95:
image

If I instead construct a prediction which is directionally correct (goes up and down when the target does), but has repeated values (has ties) I still get the same score:
image

If I remove the ties in the prediction, nothing changes in the score. For example, using

pre = numpy.array([0,0.2, 0.4, 0.3, 0.35, 0.6, 0.8])

image

Equally, introducing errors in the 3 cases above (perfect prediction, monotonically correct prediction with ties, monotonically correct prediction without ties) seems to give equal scores.

With that little exploration, I’m back to my original question: Why can’t the submission look like the target, … in other words, a column with just the 5 possible values between 0 and 1? It seems the numerai_score in the documentation would allow it.

I would appreciate any insights you may have to help me what I am missing. Thank you.

1 Like

I’m not making any theoretical statements about where advantage may be, or if they are using the “correct” scoring model. I’m telling you how the scores are generated. They break the ties in your predictions (that’s what the “first” does in the line of code there you posted – it breaks the ties by row order). But they don’t break the ties in the targets – they are left just as they are with only 5 discrete values. And then they do pearson correlation of those two vectors to get your correlation score. So you CAN have only 5 discrete values in your predictions if you want, it just isn’t smart because the ties in your predictions will be broken essentially randomly (by row order, but row order has no meaning in this data).

So if your predictions are {.25,.25,.25,.5,.5,.5,.75,.75,.75} then they are converted to {1,2,3,4,5,6,7,8,9} for scoring (i.e. no ties). So now the .25 in the 3rd position is considered to be after the .25 in the 1st position just because that’s the order of the rows. But any decent model will be able to make much finer and better distinctions between rows, so submitting predictions like {.3, .2, .25, .46,.52,.51, .68, .81,. 74} is much better because now you’ve controlled the ranks they are going to end up with {3,1,2,4,6,5,7,9,8} instead of leaving it up to chance which can lead to wildly different scores just depending on the row order (if you have a large number of ties – a few ties isn’t so bad). [In your example I think you broke the ties but not in such a way that the order changed.] And the row order is fixed, you can’t submit rows in a different order. So let your model be the best that it can be under this system without a random component to your score and don’t have ties in your predictions…

1 Like

Thank you @wigglemuse, that was a clear explanation which helped me understand what you meant.

I tried to test the idea with twice your series. But:
{.3, .2, .25, .46,.52,.51, .68, .81,. 74,.3, .2, .25, .46,.52,.51, .68, .81,. 74} -> ranked as: {0.278 , 0.056 , 0.167 , 0.389 , 0.611 , 0.5 , 0.722 , 0.944 , 0.833 , 0.333 , 0.111 , 0.222 , 0.444 , 0.667 , 0.556 , 0.778 , 1.0 , 0.889 , }

gives the same score as:
{.25,.25,.25,.5,.5,.5,.75,.75,.75,.25,.25,.25,.5,.5,.5,.75,.75,.75} -> ranked as {0.056 , 0.111 , 0.167 , 0.389 , 0.444 , 0.5 , 0.722 , 0.778 , 0.833 , 0.222 , 0.278 , 0.333 , 0.556 , 0.611 , 0.667 , 0.889 , 0.944 , 1.0}

Graphically, it would be the difference between:
image and image

But perhaps the differences will show with a larger dataset? or when prediction and target are much less correlated? I will explore further and post here if I gain any further insights.

1 Like

Yes, with tiny samples there are lots of ways to get the same correlation score. In the real tournament the typical era has around 5000 rows and you are lucky to get 5% positive correlation.