Prediction Definition Question


Hi there, brand new to Numerai and have some basic questions, the first being:

What would the prediction format, (id,probability) be predicting based on the test data (id, era, f1…f21, tgt).

The docs say:

The probability column is the probability estimated by your model of the observation being of class 1.

Is it: given another feature (f22 or f23 or f67) what is the probability of it being of class 1 ? or is it the probability of it being of class 1 in the next era ? or something else ?



Have you downloaded the dataset and looked at the sample models and example predictions file? That should clear some stuff up I think.

In brief, each datapoint has 21 numbers associated with it (the features) as well as an era, id, data_type, and target. The point is to use the features to create a model on the training dataset and use that model to predict the targets of the tournament dataset. Your output file should contain just two columns, id and the probability of the datapoint being target=1 based on your model.


Thanks, I just realized there is validation,test and live data in the tournament csv, but I am still trying to grasp intuitively what the probability against the data is trying to predict, so let me try asking differently. If I upload this prediction ( chance probability for each id )

disregarding consistency (0% ) and originality ( X ) for now, (Logloss is 0.69314 and concordance checks :slight_smile: ) what this would say is that my model has a 50% chance of predicting a class 1 given an observation using the tournament test data arrived by training my algorithms against the training set ?


Yes, perfect randomness.
The task from here is to develop a model and engineer it so you have a predictive ability better than just perfect random. The training data has a signal, but it is very weak - a lot of ‘magic’ is required to engineer a good model.
Good luck.


Thanks, I got as far as having Originality, Concordance Consistency (16.66%) and a Logloss of 0.69385, so I think I am on the right path, I am still not sure I am estimating the probability correctly and of course my model is very simple right now,so I need to work on that.

Thanks again.


If you have specific model/code questions feel free to ask, either here on on the slack. A consistency of 16.66% would suggest that you’re either not using a particularly good model or that you might be doing something wrong. A straightforward logistic regression with no other parameters on the 21 features of the training data should yield a consistency of 58.33% and validation log loss of 0.6926136.


Thanks, I’ll shoot for that next.


Should I be looking to create the model myself in python code? Or find an already existing one similarly to the “linear_model.LogisticRegression(n_jobs=-1)” that was used in the example? Or is there some way to modify/tweak the already existing ones so that they generate the model in the correct format and i’m just changing the logic from within?

And how do I tell if i’m doing the right thing and on the right path here? Should I be uploading every predictions file to the website? Or how can I check it ahead of time after the predictions file is generated and before uploading?


You can use existing types of models, but if you don’t tweak things you probably won’t beat the originality checkbox. You should probably look into some data science or machine learning online courses to learn the basics about model building and cross validation to learn about how to test your models locally to pick the best. You can check ahead of time what your validation log loss and consistency values will be, because those are straightforward checks on the data we have, but you can’t guarantee concordance and originality without uploading and seeing if your model passes.


Will do, thanks. That makes sense.


Thanks I got a little pipeline for testing models correctly now based on the Python example on the downloadable testing data.

My original mistake was that I tried (rather unsuccessfully ) to come up with the probabilities in my excel test bed, now I have something to compare and reverse engineer.

Results after tweaking out a bit the with logisticRegression :

I am failing badly at Originality, not sure if it merits opening up a new thread ( I know the higher ups mentioned they would open source it at some point), is it just a matter of changing models ?( I am itching to try SVM next for instance) or is there more to it ?

Thanks again


Well, depends on how much “tweaking” you’re doing, but just using a basic logistic regression is unlikely to reach originality without either getting it in extremely early in the round or doing some feature engineering to change what features are going into the model.

Logistic regression is a particularly low bar (there’s a reason it’s in the example file) and so a lot of beginning people are likely just trying to use slight variations of that, so tough to get originality on it.