Alternative Modelling Algorithms & Approaches

I would like to share some of my explorations in the seemingly infinite field of Machine Learning algorithms that could be applied to the Numerai tournament. I shared this answer on Stack Exchange so now I’m sharing it with you guys. I encourage others to share and discuss in this thread interesting approaches to modeling the Numerai dataset.

My intention was to not use a regular regression loss objective to model the Numerai dataset which works well. I was interested in an approach that sees these values between 0-1 as probabilities that an observation belongs to a class.

Beta Regression
In short, it represents y as distribution of probabilities of a target belonging to a class (or any other event). The link function for this regression restricts y^∈[0,1]. Interestingly, it doesn’t work if y = 0 or y = 1; two values present in the Numerai dataset.

Because the values 0 and 1 are in the Numerai dataset, the beta distribution is not the best representation of the Numerai dataset target, however, by approximating the zeroes to 0.01 and ones to 0.99 (or other similar values) the algorithm will be able to learn parameters for its model, in a hackish kinda way.
Keep in mind that the closest the value replacement is to either 0 or 1, the most biased towards the extremes the data points will represent.

Regression Models For Ordinal Data
Ordinal data regression is more straightforward. It’s still a type of regression analysis but this one is popular in social sciences where there are options that can be ordered such as a rating from 1 to 5, 1 being very poor and 5 being excellent. Modeling such a problem is complex and requires the learning of thresholds as well as the modelling of the target.
The final predictions are odds of belonging to an (ordered) class or, if a weighted mean is used for all probabilities, a continuous target similar to a regression prediction.

Regression with a Logistic Link Function
It’s easier if you implement the loss function by hand than if you use a library. Xgboost supports the inclusion if you specify reg:logistic as the objective function. Other libraries like Keras can support similar behavior as well.
This is clearly the easiest one to implement and interpret results.


Beta regression sounds as an interesting variant to try. Trying to wrap my head around it, especially combined with this part of the “analysis_and_tips” python notebook, which (if I get this right) says that models trained on the two extreme classes versus the rest both generate more or less the same result, while models trained on any of the middle classes versus the rest generate results that are alike, but negatively correlated with the models trained on the extreme classes.

The first and last class are highly correlated

plt.imshow(corrs, vmin=-1, vmax=1, cmap="RdYlGn")


Maybe beta regression, rather than logistic regression changes that picture. Interesting thought. Alternatively, maybe there is some way to break up the tournament into 5 class-vs-other-classes, and put the results back together. The whole “extremes are correlated” reminds me of my badtimes and goodtimes models, which are also almost p/1-p :-). Have to think more.


@wacax , I think there is a nice use for the beta distribution – one that keep staring us in the face. This may be parallel or incongruent to what you are doing. Either way, do you have a routine for estimating its parameters and the errors on those parameters? There is a NIST article outlining the maximum likelihood method of obtaining the parameters using estimates of the moments of the distribution which are used to obtain initial values. Looks a bit hairy, but I am wonder how different the maximum likelihood estimates would be from the initial estimates and if simple propagation of errors on the statistics would give values that one could be confident in.

No, I don’t. Hopefully, if you figure it out, you can share with us more about this mysterious use you are talking about.

Yup, I figured out how to do it. The mysterious use is to use the beta function to fit the out-of-sample scores during cross-validation. Why the beta function? Because it naturally lives on an interval, and comes with skew and excess kurtosis, unlike the Normal distribution which does not live on an interval and has no skew or excess kurtosis. We can map the beta function to the interval of correlations, [-1,1]. Once you have used maximum likelihood to fit the distribution of scores you can derive any kind of estimator you like from it. In particular maximum likelihood estimates are robust against spurious fluctuations that are statistically guaranteed to occur during parameter optimization. See how nicely it fits our data:

I also considered the Logit-Normal, but in the limit that the standard deviation goes to zero, the skew and excess kurtosis also go to zero and that is contrary to what is observed. I am exploring the Beta-Ratio, the ratio of the areas of the fitted beta distribution above and below some threshold, in some sense similar to the sortino ratio, but it remains finite no matter where you set the threshold.