Alternative Modelling Algorithms & Approaches

I would like to share some of my explorations in the seemingly infinite field of Machine Learning algorithms that could be applied to the Numerai tournament. I shared this answer on Stack Exchange so now I’m sharing it with you guys. I encourage others to share and discuss in this thread interesting approaches to modeling the Numerai dataset.

My intention was to not use a regular regression loss objective to model the Numerai dataset which works well. I was interested in an approach that sees these values between 0-1 as probabilities that an observation belongs to a class.

Beta Regression
In short, it represents y as distribution of probabilities of a target belonging to a class (or any other event). The link function for this regression restricts y^∈[0,1]. Interestingly, it doesn’t work if y = 0 or y = 1; two values present in the Numerai dataset.


Because the values 0 and 1 are in the Numerai dataset, the beta distribution is not the best representation of the Numerai dataset target, however, by approximating the zeroes to 0.01 and ones to 0.99 (or other similar values) the algorithm will be able to learn parameters for its model, in a hackish kinda way.
Keep in mind that the closest the value replacement is to either 0 or 1, the most biased towards the extremes the data points will represent.
7YgUM

Regression Models For Ordinal Data
Ordinal data regression is more straightforward. It’s still a type of regression analysis but this one is popular in social sciences where there are options that can be ordered such as a rating from 1 to 5, 1 being very poor and 5 being excellent. Modeling such a problem is complex and requires the learning of thresholds as well as the modelling of the target.
The final predictions are odds of belonging to an (ordered) class or, if a weighted mean is used for all probabilities, a continuous target similar to a regression prediction.

Regression with a Logistic Link Function
It’s easier if you implement the loss function by hand than if you use a library. Xgboost supports the inclusion if you specify reg:logistic as the objective function. Other libraries like Keras can support similar behavior as well.
This is clearly the easiest one to implement and interpret results.

1 Like

Beta regression sounds as an interesting variant to try. Trying to wrap my head around it, especially combined with this part of the “analysis_and_tips” python notebook, which (if I get this right) says that models trained on the two extreme classes versus the rest both generate more or less the same result, while models trained on any of the middle classes versus the rest generate results that are alike, but negatively correlated with the models trained on the extreme classes.

The first and last class are highly correlated

corrs=numpy.corrcoef(logistic.predict_proba(df[features]).T)
plt.imshow(corrs, vmin=-1, vmax=1, cmap="RdYlGn")
corrs

download

Maybe beta regression, rather than logistic regression changes that picture. Interesting thought. Alternatively, maybe there is some way to break up the tournament into 5 class-vs-other-classes, and put the results back together. The whole “extremes are correlated” reminds me of my badtimes and goodtimes models, which are also almost p/1-p :-). Have to think more.