Optimizing for FNC and TB scores

Hi @mdo and @jrb, I have a question about calculating exposure.

Earlier in this thread @mdo uses this:

And in a different thread (Model Diagnostics: Feature Exposure) @jrb used something a little different:

Both go up as exposure goes up, and both stop at zero if exposure is zero. But to me, @jrb’s version has the nice additional quality of being bound to lie between 0-1.0 no matter how many features are being considered.

Are there other reasons you might choose one over the other? Such as accuracy, or speed, or … ?

Thanks,

prc

Yes “penalization” of features is part of the optimizer. We allow some feature exposure but not a lot. Actually the reason we gave FNCv3 is because those are the features the optimizer is penalizing. But this is obviously a lot else going on in the optimizer to which affects TC.

1 Like

Why this might be important from a user perspective is that it seems (to me anyway) that since we can still bet on CORR as well as TC, we can unquestionably get higher CORR on average with less neutralized predictions. However, neutralization may help with TC. But if the neutralization is happening anyway as part of the process, then maybe we can get away with submitting unneutralized predictions, i.e. we don’t have to do that neutralization ourselves. So if submitting unneutralized preds vs submitting preds neutralized to the FNCv3 set are substantially the same in terms of the resulting TC scores, then unneutralized is the way to go because that will get higher corr results (in general). Whether that’s actually true or not (neutralized vs unneutralized getting more or less equal TC) depends on the order of operations in the whole TC/optimizer process I suppose.

I think that’s a fine approach. I think you can reduce the TC of a model with feature neutralization. Because the optimizer does some penalization not full neutralization so some exposure to features will help if those features work on live (especially ones the Meta Model is not already exposed to). MDO showed in the past the optimal level of feature neutralization did not appear to be 100%. I think it’s good to have models with high FNC and high CORR. I think it’s models with super high CORR due to one huge feature exposure that the Meta Model already has exposure to that can do very well on CORR in some rounds but get very negative TC.

@mdo you are centering features at 0 but not the targets (or I don’t see where). What would be the reason to not apply the -.5 also to the targets?

The loss is correlation based and is centering the target as part of the formula.

1 Like

I didn’t like the idea to neutralize every backprob, because neutralization is very slow on my pc. So I though what would happen if I would neutralize the target instead of neutralizing the predictions. I neutralized only the target in the training data and left the validation target unchanged. I didn’t train the models fully, so it is possible that they can flip later in the training.

Validation [edited version - without bug hopefully]:
Screenshot from 2022-04-29 17-51-47

Version of data: V4
Validation data: eras that were validation eras in V3
Training data: All eras - Valuation eras
Loss: mean erawise rank correlation
Number of iterations: 1000 (low, I usually train 20000+)
Model: LGBM

@mdo in your code you have:

rr = torchsort.soft_rank(pred, regularization_strength=regularization_strength)
# change pred to uniform distribution
pred = (rr - .5)/rr.shape[1]

However this is assuming that rr returns the ranked results from 0…size-1, after installing torchsort and trying a couple of times I was surprised to see that the soft_rank returns a ranking that not necessarily starts at 0.

Check the following tests:


import pytest
import torch
from torchsort import soft_rank, soft_sort


def test_less_than_one_numbers():
    z = torch.tensor([[0.4385, 0.4385, 0.4385, 0.5649]])
    ranked = soft_rank(z)
    print(ranked)
    assert ranked.min() == 0


def test_bigger_than_one_numbers():
    z = torch.tensor([[5000, 10, 20, 34, ]])
    ranked = soft_rank(z)
    print(ranked)
    assert ranked.min() == 0

    ranked = soft_rank(torch.tensor([[5000, 5000, 10, 20, 5000, 34, 10, 20, 34, ]]))
    print(ranked)
    assert ranked.min() == 0

def test_mix_big_small_numbers():
    z = torch.tensor([[5000, 10, 0.01, 0.4385, 0.5649, 20, 34, ]])
    print(soft_rank(z))
    ranked = soft_rank(z)
    assert ranked.min() == 0

This makes the correlation unrealiable I think, can you tell me exactly which library did you use for torchsort?
I’m using torchsort · PyPI for this tests.

Also, in ```
pred = (rr - .5)/rr.shape[1]


Any help to understand all this is greatly appreciated.

The output of soft_rank depends on the scale of the input. You need to adjust the regularization_strength parameter to make it give sensible results for the scale of your input data.

Thank you, you are right, I have perform more experiments to see the effect of the regularization_strength,
my conclusion is that while regularization_strength approximates it more to a hard ranking it doesn’t guarantee a hard ranking. On the contrary there are cases where two things happen, the starting value for the soft ranking >> 0 and second the difference between consecutive values !=1.

With that in mind I have the following comments:

pred = (rr - .5)/rr.shape[1]  

rr starts at a random number between 0 and len(pred) substracting .5 doesn’t make sense
dividing by rr.shape[1] does restrict the range to 0…1


    if tb is not None:
        tbidx = torch.bitwise_xor(rr<=tb, rr > (rr.shape[1]-tb))  ## problem

rr is soft ranking, we cannot rely on the ranking starting at 0 and increasing by 1. Therefore the masking
is not neccesarily working.

1 Like

Hi,

After reading the post, I thought it could also be interesting to add feature dissimilarity to the loss calculation. As I’m not sure how to compute the dataframe’s .corrwith(…) function with pytorch, I implemented a very inefficient approach that can not run on GPU (just using numpy and not the pytorch tensor tools).
Any feedback on the idea or how to implement it properly?


for f in feature_cols:
  train_data[f] -= 0.5

for epoch in range(epochs):
    np.random.shuffle(era_list)
    batch_count = 0
    acc_loss_train = 0
    for era in era_list:
        batch_count += 1

        # get features and target from data and put in tensors
        features = torch.tensor(train_data[train_data.erano == era].filter(like='feature').values)
        target = torch.tensor(train_data[train_data.erano == era]['target'])

        # zero gradient buffer and get model output
        optimizer.zero_grad()
        model.train()
        model_output = model(features)

        orig_loss = -numerair_tb(model_output, target)

        #dissimilarity
        train_era = train_data[train_data.erano == era]
        example_preds = train_era[example_col].values 
        example_preds = (example_preds - np.mean(example_preds)) / np.std(example_preds)

        train_era['example_preds'] = example_preds
        train_era['preds'] = model_output.numpy()

        u = train_era[feature_cols].corrwith(train_era['preds'])
        e = train_era[feature_cols].corrwith(train_era['example_preds'])
        dissimilarity = np.sum((np.dot(u,e)/np.dot(e,e)))

        #final loss
        loss = - orig_loss + torch.tensor(dissimilarity)

        acc_loss_train += loss 
        loss.backward()
        optimizer.step()

    loss_train = acc_loss_train / batch_count
1 Like

Think I came up with an implementation that would work on GPU as uses PyTorch. However, reading this post True Contribution Details, exposure dissimilarity seems to be relevant just combined with FNCv3 on a multiplicative way so it might not make sense to use it without it.
Any feedback is more than welcome!

for f in feature_cols:
  train_data[f] -= 0.5

for epoch in range(epochs):
    np.random.shuffle(era_list)
    batch_count = 0
    acc_loss_train = 0
    for era in era_list:
        batch_count += 1

        # get features and target from data and put in tensors
        features = torch.tensor(train_data[train_data.erano == era].filter(like='feature').values)
        target = torch.tensor(train_data[train_data.erano == era]['target'])

        # zero gradient buffer and get model output
        optimizer.zero_grad()
        model.train()
        model_output = model(features)

        orig_loss = -numerair_tb(model_output, target)

        #dissimilarity
        train_era = train_data[train_data.erano == era]

        example_preds = torch.as_tensor(train_era['example_preds'].values) #Needs to be created previously
        example_preds = example_preds - example_preds.mean()
        corr_example_preds = (features.T * example_preds).sum(dim=1) / ((features.T * features.T).sum(dim=1) * (example_preds * example_preds).sum()).sqrt()

        preds = model_output
        preds = preds - preds.mean()
        corr_preds = (features.T * preds).sum(dim=1) / ((features.T * features.T).sum(dim=1) * (preds * preds).sum()).sqrt()

        num = corr_preds.pinverse(rcond=1e-6).dot(corr_example_preds)
        denom = corr_example_preds.pinverse(rcond=1e-6).dot(corr_example_preds)

        dissimilarity = (num/denom).sum()

        #final loss
        loss = - orig_loss + dissimilarity

        acc_loss_train += loss 
        loss.backward()
        optimizer.step()

    loss_train = acc_loss_train / batch_count
3 Likes