I would like to share some of my explorations in the seemingly infinite field of Machine Learning algorithms that could be applied to the Numerai tournament. I shared this answer on Stack Exchange so now I’m sharing it with you guys. I encourage others to share and discuss in this thread interesting approaches to modeling the Numerai dataset.
My intention was to not use a regular regression loss objective to model the Numerai dataset which works well. I was interested in an approach that sees these values between 0-1 as probabilities that an observation belongs to a class.
In short, it represents y as distribution of probabilities of a target belonging to a class (or any other event). The link function for this regression restricts y^∈[0,1]. Interestingly, it doesn’t work if y = 0 or y = 1; two values present in the Numerai dataset.
Because the values 0 and 1 are in the Numerai dataset, the beta distribution is not the best representation of the Numerai dataset target, however, by approximating the zeroes to 0.01 and ones to 0.99 (or other similar values) the algorithm will be able to learn parameters for its model, in a hackish kinda way.
Keep in mind that the closest the value replacement is to either 0 or 1, the most biased towards the extremes the data points will represent.
Regression Models For Ordinal Data
Ordinal data regression is more straightforward. It’s still a type of regression analysis but this one is popular in social sciences where there are options that can be ordered such as a rating from 1 to 5, 1 being very poor and 5 being excellent. Modeling such a problem is complex and requires the learning of thresholds as well as the modelling of the target.
The final predictions are odds of belonging to an (ordered) class or, if a weighted mean is used for all probabilities, a continuous target similar to a regression prediction.
Regression with a Logistic Link Function
It’s easier if you implement the loss function by hand than if you use a library. Xgboost supports the inclusion if you specify reg:logistic as the objective function. Other libraries like Keras can support similar behavior as well.
This is clearly the easiest one to implement and interpret results.