Optimizing for FNC and TB scores

you calculate the corr score per era but I do not see a reason why you should predict in batches.

1 Like

My doubt is if at the end of each training epoch, it makes sense to do an early stopping check using validation data with validation eras like this:


def validation_early_stopping(val_data, model):
    model.eval()
    era_list = eras_validation.unique()
    np.random.shuffle(era_list)
    batch_count = 0
    acc_loss_val = 10000
    
    with torch.no_grad(): 
      for era in era_list:
          batch_count += 1
          # get features and target from data and put in tensors
          features = torch.tensor(val_data[val_data.erano == era].filter(items=feature_names).values) - .5
          target = torch.tensor(val_data[val_data.erano == era]['target'])
          features = features.cuda()
          target = target.cuda()

          output = model(features)
          
          # neutralize model output
          b = features.pinverse(rcond=1e-6) @ output
          linear_pred = features @ b
          neutralized_output = output - linear_pred

          neut_tb_loss, neut_tb_exp = numerai_r_tb_exposure(neutralized_output, target, features, tb=500)
          neut_loss = numerair_tb(neutralized_output, target)
          orig_loss, orig_exp = numerai_r_tb_exposure(output, target, features)
          
          loss = -neut_tb_loss - neut_loss - orig_loss + neut_tb_exp/1e3 + orig_exp/1e4
          
          acc_loss_val += loss

      loss_val = acc_loss_val / batch_count
      return loss_val.item()

As we are using TB500, I’m not sure if the size or the composition of the validation batches matters here of if it’s even conceptually correct to check early stopping like this in this case.

Would there be an update soon for the numerai tournament? I wasn’t sure if something had already changed but I did not notice. I am not sure where to go for big notices like that.

Perhaps I misunderstand the meaning for TB500. I believe that to reference the Top/Bottom 500 prediction values. This is indeed a smaller subset than the full era, however in the code it looks like the TB500 samples are being used in their neutralized form as an addition to the full set of sample losses. Effectively adding extra pressure to the top and bottom 500 to improve performance.

Unless I’m reading this code incorrectly ( very possible), the neut_loss ( effectively the corr for the entire set of neutralized predictions) and the orig_loss ( corr for the entire set of raw predictions) are being maximized due to the “-” when they are included in the final loss calculation. This is also where the additional loss from the tb500 are included.

If I’m reading this wrong however, I would love a clear breakdown of the process.

Sounds like you’ve got it!

@mdo if Numerai is performing Feature Neutralization on our predictions before TC calculations, would it not help to know which features the team is using to neutralize with? As Numerai is not aware of which features we may have used to generate predictions, how have they determined the best features to use in neutralization? Or do they use them all?

I see above a mention:

FNCv3 is the 420 features of the “medium” featureset they released.
On RC they announced that code for FNCv3 will be released soon

is this information public somewhere? I don’t find it on ‘Neutralization’ section of the Docs.

Predictions are not neutralized before TC calculations, just Gauss-rank transformed.

2 Likes

Isn’t neutralization part of the optimizer though?

Hi @mdo and @jrb, I have a question about calculating exposure.

Earlier in this thread @mdo uses this:

And in a different thread (Model Diagnostics: Feature Exposure) @jrb used something a little different:

Both go up as exposure goes up, and both stop at zero if exposure is zero. But to me, @jrb’s version has the nice additional quality of being bound to lie between 0-1.0 no matter how many features are being considered.

Are there other reasons you might choose one over the other? Such as accuracy, or speed, or … ?

Thanks,

prc

Yes “penalization” of features is part of the optimizer. We allow some feature exposure but not a lot. Actually the reason we gave FNCv3 is because those are the features the optimizer is penalizing. But this is obviously a lot else going on in the optimizer to which affects TC.

1 Like

Why this might be important from a user perspective is that it seems (to me anyway) that since we can still bet on CORR as well as TC, we can unquestionably get higher CORR on average with less neutralized predictions. However, neutralization may help with TC. But if the neutralization is happening anyway as part of the process, then maybe we can get away with submitting unneutralized predictions, i.e. we don’t have to do that neutralization ourselves. So if submitting unneutralized preds vs submitting preds neutralized to the FNCv3 set are substantially the same in terms of the resulting TC scores, then unneutralized is the way to go because that will get higher corr results (in general). Whether that’s actually true or not (neutralized vs unneutralized getting more or less equal TC) depends on the order of operations in the whole TC/optimizer process I suppose.

I think that’s a fine approach. I think you can reduce the TC of a model with feature neutralization. Because the optimizer does some penalization not full neutralization so some exposure to features will help if those features work on live (especially ones the Meta Model is not already exposed to). MDO showed in the past the optimal level of feature neutralization did not appear to be 100%. I think it’s good to have models with high FNC and high CORR. I think it’s models with super high CORR due to one huge feature exposure that the Meta Model already has exposure to that can do very well on CORR in some rounds but get very negative TC.

@mdo you are centering features at 0 but not the targets (or I don’t see where). What would be the reason to not apply the -.5 also to the targets?

The loss is correlation based and is centering the target as part of the formula.

1 Like

I didn’t like the idea to neutralize every backprob, because neutralization is very slow on my pc. So I though what would happen if I would neutralize the target instead of neutralizing the predictions. I neutralized only the target in the training data and left the validation target unchanged. I didn’t train the models fully, so it is possible that they can flip later in the training.

Validation [edited version - without bug hopefully]:
Screenshot from 2022-04-29 17-51-47

Version of data: V4
Validation data: eras that were validation eras in V3
Training data: All eras - Valuation eras
Loss: mean erawise rank correlation
Number of iterations: 1000 (low, I usually train 20000+)
Model: LGBM

@mdo in your code you have:

rr = torchsort.soft_rank(pred, regularization_strength=regularization_strength)
# change pred to uniform distribution
pred = (rr - .5)/rr.shape[1]

However this is assuming that rr returns the ranked results from 0…size-1, after installing torchsort and trying a couple of times I was surprised to see that the soft_rank returns a ranking that not necessarily starts at 0.

Check the following tests:


import pytest
import torch
from torchsort import soft_rank, soft_sort


def test_less_than_one_numbers():
    z = torch.tensor([[0.4385, 0.4385, 0.4385, 0.5649]])
    ranked = soft_rank(z)
    print(ranked)
    assert ranked.min() == 0


def test_bigger_than_one_numbers():
    z = torch.tensor([[5000, 10, 20, 34, ]])
    ranked = soft_rank(z)
    print(ranked)
    assert ranked.min() == 0

    ranked = soft_rank(torch.tensor([[5000, 5000, 10, 20, 5000, 34, 10, 20, 34, ]]))
    print(ranked)
    assert ranked.min() == 0

def test_mix_big_small_numbers():
    z = torch.tensor([[5000, 10, 0.01, 0.4385, 0.5649, 20, 34, ]])
    print(soft_rank(z))
    ranked = soft_rank(z)
    assert ranked.min() == 0

This makes the correlation unrealiable I think, can you tell me exactly which library did you use for torchsort?
I’m using torchsort · PyPI for this tests.

Also, in ```
pred = (rr - .5)/rr.shape[1]


Any help to understand all this is greatly appreciated.

The output of soft_rank depends on the scale of the input. You need to adjust the regularization_strength parameter to make it give sensible results for the scale of your input data.

Thank you, you are right, I have perform more experiments to see the effect of the regularization_strength,
my conclusion is that while regularization_strength approximates it more to a hard ranking it doesn’t guarantee a hard ranking. On the contrary there are cases where two things happen, the starting value for the soft ranking >> 0 and second the difference between consecutive values !=1.

With that in mind I have the following comments:

pred = (rr - .5)/rr.shape[1]  

rr starts at a random number between 0 and len(pred) substracting .5 doesn’t make sense
dividing by rr.shape[1] does restrict the range to 0…1


    if tb is not None:
        tbidx = torch.bitwise_xor(rr<=tb, rr > (rr.shape[1]-tb))  ## problem

rr is soft ranking, we cannot rely on the ranking starting at 0 and increasing by 1. Therefore the masking
is not neccesarily working.

1 Like

Hi,

After reading the post, I thought it could also be interesting to add feature dissimilarity to the loss calculation. As I’m not sure how to compute the dataframe’s .corrwith(…) function with pytorch, I implemented a very inefficient approach that can not run on GPU (just using numpy and not the pytorch tensor tools).
Any feedback on the idea or how to implement it properly?


for f in feature_cols:
  train_data[f] -= 0.5

for epoch in range(epochs):
    np.random.shuffle(era_list)
    batch_count = 0
    acc_loss_train = 0
    for era in era_list:
        batch_count += 1

        # get features and target from data and put in tensors
        features = torch.tensor(train_data[train_data.erano == era].filter(like='feature').values)
        target = torch.tensor(train_data[train_data.erano == era]['target'])

        # zero gradient buffer and get model output
        optimizer.zero_grad()
        model.train()
        model_output = model(features)

        orig_loss = -numerair_tb(model_output, target)

        #dissimilarity
        train_era = train_data[train_data.erano == era]
        example_preds = train_era[example_col].values 
        example_preds = (example_preds - np.mean(example_preds)) / np.std(example_preds)

        train_era['example_preds'] = example_preds
        train_era['preds'] = model_output.numpy()

        u = train_era[feature_cols].corrwith(train_era['preds'])
        e = train_era[feature_cols].corrwith(train_era['example_preds'])
        dissimilarity = np.sum((np.dot(u,e)/np.dot(e,e)))

        #final loss
        loss = - orig_loss + torch.tensor(dissimilarity)

        acc_loss_train += loss 
        loss.backward()
        optimizer.step()

    loss_train = acc_loss_train / batch_count
1 Like

Think I came up with an implementation that would work on GPU as uses PyTorch. However, reading this post True Contribution Details, exposure dissimilarity seems to be relevant just combined with FNCv3 on a multiplicative way so it might not make sense to use it without it.
Any feedback is more than welcome!

for f in feature_cols:
  train_data[f] -= 0.5

for epoch in range(epochs):
    np.random.shuffle(era_list)
    batch_count = 0
    acc_loss_train = 0
    for era in era_list:
        batch_count += 1

        # get features and target from data and put in tensors
        features = torch.tensor(train_data[train_data.erano == era].filter(like='feature').values)
        target = torch.tensor(train_data[train_data.erano == era]['target'])

        # zero gradient buffer and get model output
        optimizer.zero_grad()
        model.train()
        model_output = model(features)

        orig_loss = -numerair_tb(model_output, target)

        #dissimilarity
        train_era = train_data[train_data.erano == era]

        example_preds = torch.as_tensor(train_era['example_preds'].values) #Needs to be created previously
        example_preds = example_preds - example_preds.mean()
        corr_example_preds = (features.T * example_preds).sum(dim=1) / ((features.T * features.T).sum(dim=1) * (example_preds * example_preds).sum()).sqrt()

        preds = model_output
        preds = preds - preds.mean()
        corr_preds = (features.T * preds).sum(dim=1) / ((features.T * features.T).sum(dim=1) * (preds * preds).sum()).sqrt()

        num = corr_preds.pinverse(rcond=1e-6).dot(corr_example_preds)
        denom = corr_example_preds.pinverse(rcond=1e-6).dot(corr_example_preds)

        dissimilarity = (num/denom).sum()

        #final loss
        loss = - orig_loss + dissimilarity

        acc_loss_train += loss 
        loss.backward()
        optimizer.step()

    loss_train = acc_loss_train / batch_count
3 Likes