Differentiable Spearman in PyTorch (Optimize for CORR directly)

According to a github issue:

At the moment we do not have a GPU implementation of the projection operators, which is the cause for the error. We decided not to do this conversion implicitly as we want the user to be aware that a device copy is necessary. If you want that behavior, can you write a small util function like

def soft_sort(array):
   return pytorch_ops.soft_sort(array.cpu()).cuda()

This solution worked fine for me. Having to perform the operation on CPU is not ideal, but it didn’t seem to penalize training times too poorly.

1 Like

Update: I am currently working on a pure PyTorch implementation, GitHub - teddykoker/torchsort: Fast, differentiable sorting and ranking in PyTorch, of the differentiable sorting and ranking algorithm, which will allow for faster computation on GPU directly. It currently works, but is slower than the original (I believe my C++ isotonic regression solver needs some optimization as it is slower than the Numba JIT’d python) Edit: got an optimized version working, should be out soon. (GPU kernels are a pain :roll_eyes:)

4 Likes

Has anyone pulled this off for TensorFlow? I would assume it shouldn’t be much work converting the existing PyTorch example.

I am just starting (submitted my first today) so I have a lot of work to do with my model.

Torchsort now has a speed on par with PyTorch’s built in sort (but is differentiable). Numerai spearman loss can be implemented like:

import torchsort

def corrcoef(target, pred):
    pred_n = pred - pred.mean()
    target_n = target - target.mean()
    pred_n = pred_n / pred_n.norm()
    target_n = target_n / target_n.norm()
    return (pred_n * target_n).sum()


def spearman(
    target,
    pred,
    regularization="l2",
    regularization_strength=1.0,
):
    pred = torchsort.soft_rank(
        pred,
        regularization=regularization,
        regularization_strength=regularization_strength,
    )
    return corrcoef(target, pred / pred.shape[-1])

I implemented the CUDA kernel as well, it scales pretty well with batch size and sequence length:

6 Likes

Thanks for sharing @teddykoker. I have been willing to implement Fast-Soft_Sort for a few months now so that should help. I will play around with it and try to optimize on Corr/Sharpe/Custom functions over the weekend see what comes out of it. Did you experience any memory limitation implementing it?

Thank you for your reply. I could run soft_sort on my Colab.

Look at the source code. It already has a tensorflow implementation of fast_sort_soft. GitHub - google-research/fast-soft-sort: Fast Differentiable Sorting and Ranking

I put this into a function that can be passed to xgboost as objective function (See Custom Objective and Evaluation Metric — xgboost 1.5.0-SNAPSHOT documentation). This is Proof of Concept and has not provided any meaningful result for me yet. But in case anyone is interested here is the (messy) code:

from fast_soft_sort.pytorch_ops import soft_rank

def corrcoef(target, pred):
    # np.corrcoef in torch from @mdo
    # https://forum.numer.ai/t/custom-loss-functions-for-xgboost-using-pytorch/960
    pred_n = pred - pred.mean()
    target_n = target - target.mean()
    pred_n = pred_n / pred_n.norm()
    target_n = target_n / target_n.norm()
    return (pred_n * target_n).sum()


def spearman(
    target,
    pred,
    regularization="l2",
    regularization_strength=1.0,
):
    
    pred = soft_rank(
        pred,
        regularization=regularization,
        regularization_strength=regularization_strength,
    )
    return corrcoef(target, pred / pred.shape[-1])


def custom_loss(ytrue, ypred):
    lenypred = ypred.shape[0]
    lenytrue = ytrue.shape[0]

    ypred_th = torch.tensor(ypred.reshape(1, lenypred), requires_grad=True)
    ytrue_th = torch.tensor(ytrue.reshape(1, lenytrue))

    loss = spearman(ytrue_th, ypred_th, regularization_strength=3)
    print(f'Current loss:{loss}')

    # calculate gradient and convert to numpy
    loss_grads = torch.autograd.grad(loss, ypred_th)[0]
    loss_grads = loss_grads.detach().numpy()

    # return gradient and ones instead of Hessian diagonal
    return loss_grads[0], np.ones(loss_grads.shape)[0]


params["objective"] = custom_loss
model = xgboost.XGBRegressor(**params).fit(X, y)

as @teddykoker mentioned, tuning the regularization might be one of the most important aspects. After all, I am still not convinced that it is actully useful to use spearman as objective function, but definitely worth a try.

Has anyone reached good correlation with using the differentiable spearman custom objective?
I tried with lightgbm, it made disaster…

It’s my optuna study trials. The vertical is differentiable spearman score.

image

which code did yo try?

I mixed @paulito & @teddykoker 's code for the objective in lightgbm as below.

def spearman_loss_lgb(ytrue, ypred):
    
    def corrcoef(target, pred):
        pred_n = pred - pred.mean()
        target_n = target - target.mean()
        pred_n = pred_n / pred_n.norm()
        target_n = target_n / target_n.norm()
        return (pred_n * target_n).sum()

    def differentiable_spearman(target, pred, regularization="l2", regularization_strength=1.0,):
        pred = torchsort.soft_rank(
            pred,
            regularization=regularization,
            regularization_strength=regularization_strength,
        )
        return corrcoef(target, pred / pred.shape[-1])
    
    lenypred = ypred.shape[0]
    lenytrue = ytrue.shape[0]

    ypred_th = torch.tensor(ypred.reshape(1, lenypred), requires_grad=True)
    ytrue_th = torch.tensor(ytrue.reshape(1, lenytrue))

    loss = differentiable_spearman(ytrue_th, ypred_th, regularization_strength=1e-2)
    # print(f'Current loss:{loss}')

    # calculate gradient and convert to numpy
    loss_grads = torch.autograd.grad(loss, ypred_th)[0]
    loss_grads = loss_grads.to('cpu').detach().numpy()

    # return gradient and ones instead of Hessian diagonal
    return loss_grads[0], np.ones(loss_grads.shape)[0]
5 Likes

And by disaster you mean the huge swings in the score?
Maybe it is just me but it looks like it is converging … have you tried simply running 300 more trials?

terribly stupid question: Is this a loss or do we have to return -corrcoef(…) when we want to use this as a loss function?

At the risk of exposing a complete lack of understanding, I hope someone here can clear up a little confusion I have regarding torch_sort.soft_rank().

I have been using this in pytorch, with some success, but I’m wondering if I may still be implementing the loss incorrectly and simply getting lucky.

I notice in the example docs that torch.autograd.grad() is used to compute the gradient. What I do not understand is whether this is needed for a full torch implementation or if this is being done for people to extract the gradient and use in another tool set, such as XGB.

In a fully torch module, if I calculate the correlation Loss and then apply loss.backward(), is there really any need for me to extract the gradient myself?

Here is my code:

import torchsort
def t_corrcoef(target, pred):
    pred_n = pred - pred.mean()
    target_n = target - target.mean()
    pred_n = pred_n / pred_n.norm()
    target_n = target_n / target_n.norm()
    return (pred_n * target_n).sum()

#
# FUNCTION: t_spearman()
#   - to calculate differentiable spearman corr for torch training
#
def t_spearman( target, pred, regularization="l2", regularization_strength=1.0):
    # fast_soft_sort uses 1-based indexing, divide by len to compute percentage of rank

    pred = torchsort.soft_rank( pred.cpu(),
                                regularization=regularization,
                                regularization_strength=regularization_strength )
    return t_corrcoef(target, pred.to( target.device) / pred.to( target.device).shape[-1])

so my understanding is that calculating corr_loss would follow:

corr = t_spearman( batch[ 'Y'].unsqueeze(0), preds.unsqueeze(0), regularization="l2", regularization_strength=1.0)
loss = 1.0 - corr
loss.backward()

Is this correct? If so, might there be a way to apply sample based weights to the loss. Similar to using torch.nn.MSELoss( reduction=‘none’), in order to provide a greater penalty to samples based on the non-uniform distribution?

Hi @teddykoker , thank you very much for your code. I was trying to use it for a loss custom function to be used with TabNet but I had an issue that took me sometime to debug. Your function expects a tensor in the form (1,X) while TabNet passes it in the form (X,1) so I had to reshape them. The function uses the default regularization (that is “l2”) and seems to work fine.

def spearman(pred, target):

    x = 1e-2
    pred = torchsort.soft_rank(pred.reshape(1,-1),regularization_strength=x)
    target = torchsort.soft_rank(target.reshape(1,-1),regularization_strength=x)
    pred = pred - pred.mean()
    pred = pred / pred.norm()
    target = target - target.mean()
    target = target / target.norm()

    return (pred * target).sum()

In case someone needs also a metric this one should work:

class Sprme_Metric(Metric):
    """
    sprme.
    """

    def __init__(self):
        self._name = "sprme" # write an understandable name here
        self._maximize = True

    def __call__(self, y_true, y_score):
        """
        Compute Spearman Correlation of predictions.

        Parameters
        ----------
        y_true: np.ndarray
            Target matrix or vector
        y_score: np.ndarray
            Score matrix or vector

        Returns
        -------
            float
            Spearman of predictions vs targets.
        """
        return spearman(torch.from_numpy(y_score), torch.from_numpy(y_true)).item()
1 Like

I have a kind of Hamlet doubt: in regression we should try to minimise the loss. In such a case the loss should return 1 - ret instead of just ret. Since the Spearman index goes from -1 to +1, 1 - ret will be zero when the index is +1 (what we want, perfect positive correlation) and will be maximum when the index is -1 (that we don’t want, perfect negative correlation). So to minimise the loss we should find an index close to 1.
Am I wrong? :thinking:
In such a case the metric should return 1 - ret. In fact we want to maximise the metric and the value will be maximum when ret is zero (that is when the index is 1, what we want), and will be the minimum when ret is equal to two (that is when the index is -1 that we don’t want). Otherwise we can simply return the amended loss (1-ret) and set _maximize = False.
What do you think?

here the revised code:

def spearman(pred, target):

    x = 1e-3
    pred = torchsort.soft_rank(pred.reshape(1,-1),regularization_strength=x)
    target = torchsort.soft_rank(target.reshape(1,-1),regularization_strength=x)
    pred = pred - pred.mean()
    pred = pred / pred.norm()
    target = target - target.mean()
    target = target / target.norm()
    ret = 1- (pred * target).sum()
    return ret

In my case x = 1e-3 gave better results. And for the metric (the simplest form):

class Sprme_Metric(Metric):
    """
    sprme.
    """

    def __init__(self):
        self._name = "sprme" # write an understandable name here
        self._maximize = False

    def __call__(self, y_true, y_score):
        """
        Compute Spearman Correlation of predictions.

        Parameters
        ----------
        y_true: np.ndarray
            Target matrix or vector
        y_score: np.ndarray
            Score matrix or vector

        Returns
        -------
            float
            Spearman of predictions vs targets.
        """
        return spearman(torch.from_numpy(y_score), torch.from_numpy(y_true)).item()

I am new and have very little concrete to go on. Only have stable submissions for 253 and 254. My Feature neutralized training models have the best validation by a bit so they are weighted a bit more but it is not all of the models. And I also still had FE so I have post process neutralized some of them as well. Of 3 models and 2 rounds all are positive overall and on the day if that answers your question.

they are basically all the same though so I can see nuances of live. 1 is optimized blend with .5 post process neutralization, 1 is optimized blend with no post process neutralization, 1 is even blend with .5 post process neutralization,

Today was a different day for sure though. 254 is very positive for me but still completely crushed by integration_test and linear models. Also my non post processed model is out performing the other 2 which it is not on 253.

Really I have very little to go on so far. And today is the first day of 254 so going to be volatile.

1 Like

Is there feature neutralization code for pytorch?
I did some trials but never got it to work. Furthest I got was that there is no GPU implementation for least squares.

There is a code for neutralization with pytorch in this thread by @mdo (in the training loop using pytorch’s pinverse):

2 Likes

Hi Teddy, training models for corr is interesting- but have you compared model performance vs. training for the cyrus_v4_20 target? i’d think that a model that can predict cyrus_v4_20 should also rank well for corr20v2. the hiccup i’ve been running up against is that sometimes high corr20v2 performance results in negative tc. i’d love to hear your thoughts on this.