Learning to Rank

I’d mentioned this on OHWA #12 yesterday, and @arbitrage suggested that I post the idea here. The idea is as follows: It is perhaps worth taking a step back and rethinking the tournament as a learning to rank problem rather than a regression problem.

The metric we’re trying to optimize for is a ranking metric which is scale invariant, and the only constraint is that the predicted targets are within the interval [0, 1]. With that in mind, we don’t need to optimize for squared error, absolute error or something in between the two. The only requirement is that the predicted targets are co-monotonic with the labels. This in my opinion, opens up the design space for loss functions and learning algorithms in general.

Also, I’ve empirically found that although a regression based loss function helps with performance initially. But once the learner converges sufficiently, a regression based loss actively hurts the learner’s performance. I think it’s because memorization (which the regression metric optimizes for) is good, until it isn’t. This also explains why it’s so easy to overfit the data. We try to alleviate this with regularization, but perhaps we could do better than just that.

Traditionally, in learning to rank literature, the problem is usually expressed in 3 ways.

  1. Pointwise ranking
  2. Pairwise ranking
  3. Listwise ranking

Treating the problem as a pointwise ranking problem is essentially treating it as a classification/regression problem. In LTR benchmarks, pairwise ranking almost always beats pointwise ranking. The intuition behind this is that comparing a pair of datapoints is easier than evaluating a single data point. Also, the learner has access to two sets of features to learn from, rather than just one.

The XGBoost Python API comes with a simple wrapper around its ranking functionality called XGBRanker, which uses a pairwise ranking objective. catboost and lightgbm also come with ranking learners. I’ve added the relevant snippet from a slightly modified example model to replace XGBRegressor with XGBRanker. This model does almost as well as the example model on the usual battery of metrics. Tuning hyperparameters to make it better than the example model is left as an exercise to the reader. Listwise ranking could probably be the topic for a future discussion.

Can anyone think of a way to get this to beat the example model? Perhaps combine this with the other recent ideas of era-boosting and feature neutralization.

    model = XGBRanker(max_depth=5, learning_rate=0.01, n_estimators=2000, n_jobs=-1, colsample_bytree=0.1)
    cdf = training_data.groupby('era').agg(['count'])
    group = cdf[cdf.columns[0]].values
    del cdf
    model.fit(training_data[feature_names], training_data[TARGET_NAME], group=group)

Edits: Fixed a typo: s/Treating the problem as a pairwise ranking problem …/Treating the problem as a pointwise ranking problem is essentially treating it as a classification/regression problem/


I have done some experiments with what I call head-to-head models (pairwise comparisons) and while it worked, so far I haven’t gotten it to work quite as well as…not doing that. I do use decision trees, but none of the typical boosting libraries so they may implement it in a more clever way than I do, and I do have a few ideas for next time I try it.

@wigglemuse When you say head-to-head models, do you mean Siamese nets?

And yes, metric learning (with contrastive loss) and pairwise ranking are very closely related.

I’m using decision trees, so they aren’t NNs, and if there is a special name for them I don’t know. I just mean instead of taking a bunch of rows of the data as given (more or less) and predicting that, I make a dataset where each row contains two original rows appended together (so 620 features instead of 310) or maybe one row subtracted from another. And then the target is 1/0 and you are predicting which row is greater than the other (you’d only compare rows that did not have equal targets originally). I used to do this with literal horse racing models, i.e. instead of “who is going to win the race?” I’d break it down into “is this horse going to beat that one?” and then it becomes a one-on-one comparison / sorting problem.


That’s quite neat! Your predictor will likely do much better if its output is expressed as a trichotomy. Quite like how comparator functions work in sorting algorithms.

Adding the equality comparison just seemed to increase the training burden with no benefit, but I haven’t really taken a deep dive into the method as it looked like it would need some improvements just to get even performance-wise with the more usual methods.

CatBoost GPU example code:

params = {
    'loss_function': 'PairLogit:max_pairs=100000',
    'task_type': 'GPU',
    'devices': '0'
m = CatBoost(params)
# traind = numerox Data object for training
m.fit(traind.x, traind.y['kazutsugi'], group_id=traind.era_float.astype(int))

I was torturing my 1080Ti for the whole night but I am not really able to find any models significantly better than I’ve already got using RMSE loss (yet). Exploring both normal & target neutralized possibilities.

One thing is clear though - pairwise mode requires much more GPU power (I limit amount of pairs to 100k per era, but still) than pointwise mode.

Maybe I will try listwise mode as well but it seems to run even slower.

1 Like

Yeah, you get into the problem where theoretically you’d want to compare every row with every other row in your dataset, and so that becomes infeasible and you have to subsample from all those possibilities but in a way that you can still get a reliable ranking for each original row.

1 Like

LTR is often used in information retrieval. There’s a query and multiple documents and the documents need to be ranked. The ‘Y’ is the labelled relevance score assigned to a feature vector (constructed using q and d). I’m having a hard time relating this to the numerai dataset.
In the code you’ve grouped by eras that means era==query, features are being ranked according to target scores (not relevance scores). None of this matches with the earlier context.
I may have misunderstood some details, could you elaborate on this further.

i am currently looking at locality sensitive hashing as a kind of nearest neighbour in high dimension in a scalable way (https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134)…

Not yet sure if it really helps though…