Optimizing for FNC and TB scores

mdo · March 22, 2022, 10:21pm

With the advent of TC many users may wonder how to optimize for metrics beyond correlation and mean-squared error. Here we will show how to directly optimize for metrics like TB200 and FNC. This is intended to be a proof of concept and source of inspiration, not a set of instructions.

A previous forum post demonstrated how to optimize for Spearman correlation directly. This work can be extended fairly simply to allow for the optimization of a top/bottom correlation (e.g. TB200) where only the most extreme values of the prediction are used in the correlation function.

import torch
import pandas as pd
import numpy as np
import torchsort
from torch.distributions import Normal
from torch.functional import F
import torch.optim as optim
from torch import nn

normal = Normal(0,1)

def numerair_tb(pred, target, tb=None, gaussianize=False, regularization_strength=.0001):
    # Computes and returns a differentiable Numerai score with option to use only 
    # the top and bottom tb values. Use the gaussianize option to perform Gauss-rank 
    # instead of just rank transform on predictions
    
    pred = pred.reshape(1, -1)
    target = target.reshape(1, -1)
    
    # get sorted indicies
    rr = torchsort.soft_rank(pred, regularization_strength=regularization_strength)
    
    # change pred to uniform distribution
    pred = (rr - .5)/rr.shape[1]
    
    # convert uniform to gaussian distribution
    if gaussianize:
        pred = normal.icdf(pred)
        
    # select top/bottom indices
    if tb is not None:
        tbidx = torch.bitwise_xor(rr<=tb, rr > (rr.shape[1]-tb))
        pred = pred[tbidx]
        target = target[tbidx]
    
    # Pearson correlation
    pred = pred - pred.mean()
    pred = pred / pred.norm()
    target = target - target.mean()
    target = target / target.norm()
    return (pred * target).sum()

If we want to control feature exposure of the top/bottom part of the signal, it can be helpful to have the correlation function return this exposure as well so it can be incorporated into the overall cost function. A modified version of the above to return the total feature exposure:

import torch
import pandas as pd
import numpy as np
import torchsort
from torch.distributions import Normal
from torch.functional import F
import torch.optim as optim
from torch import nn

normal = Normal(0,1)

def numerai_r_tb_exposure(pred, target, features, tb=None, gaussianize=False, regularization_strength=.0001):
    # Computes and returns a Numerai score and feature exposure
    
    pred = pred.reshape(1, -1)
    target = target.reshape(1, -1)
    
    # get sorted indicies
    rr = torchsort.soft_rank(pred, regularization_strength=regularization_strength)
    # change pred to uniform distribution
    pred = (rr - .5)/rr.shape[1]
    
    # convert uniform to gaussian distribution
    if gaussianize:
        pred = normal.icdf(pred)
        
    # select top/bottom indicies
    if tb is not None:
        tbidx = torch.bitwise_xor(rr<=tb, rr > (rr.shape[1]-tb))
        pred = pred[tbidx]
        target = target[tbidx]
        features = features[tbidx[0]]
    
    # Pearson correlation
    pred = pred - pred.mean()
    pred = pred / pred.norm()
    target = target - target.mean()
    target = target / target.norm()
    
    return (pred * target).sum(), ((pred @ features)**2).sum()

We can use the above cost functions to compute CORR and TB scores as well as feature penalty terms. The inclusion of a differentiable version of the psudoinverse in Pytorch, means we can feature-neutralize a model’s predictions and directly optimize for FNC as well. Now we will show how to train a simple neural network on a cost function optimizing for FNC, FNC TB500, CORR, while penalizing feature exposure in the prediction and the top/bottom 500 of the neutralized prediction. (We’ve found TB500 a bit more stable to use for optimization as TB200 tends to overfit easily.) We initialize a simple neural network like:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lin1 = nn.Linear(1050, 100)
        self.lin2 = nn.Linear(100, 30)
        self.lin3 = nn.Linear(30, 1)
        self.bn = nn.BatchNorm1d(1)
        self.do1 = nn.Dropout(0.5)
        self.do2 = nn.Dropout(0.5)

    def forward(self, x):
        x = self.lin1(x)
        x = self.do1(F.mish(x))
        x = self.lin2(x)
        x = self.do2(F.mish(x))
        output = self.bn(self.lin3(x))
        return output

We can then set up a training loop as follows to optimize for this multi-part cost function.

for epoch in range(epochs):
    np.random.shuffle(era_list)
    for ii, era in enumerate(era_list):
        # get features and target from data and put in tensors
        features = torch.tensor(training_data[training_data.era == era].filter(like='feature').values) - .5
        target = torch.tensor(training_data[training_data.era == era]['target'])

        # zero gradient buffer and get model output
        optimizer.zero_grad()
        model.train()
        output = model(features)

        # neutralize model output
        b = features.pinverse(rcond=1e-6) @ output
        linear_pred = features @ b
        neutralized_output = output - linear_pred

        
        neut_tb_loss, neut_tb_exp = numerai_r_tb_exposure(neutralized_output, target, features, tb=500)
        neut_loss = numerair_tb(neutralized_output, target)
        orig_loss, orig_exp = numerai_r_tb_exposure(output, target, features)
        
        # loss = -tb500 corr for neutralized output - corr for neutralized output - corr + tb500 exposure + exposure
        loss = -neut_tb_loss - neut_loss - orig_loss \
                + neut_tb_exp/1e3 + orig_exp/1e4

        loss.backward()
        optimizer.step()

We’ve trained a model using this code and have submitted it here. The validation statistics for this model are here. Again this is far from optimized and is meant only to show what is possible, but it seems fairly decent already. Cheers and good luck!

perfect_fit · March 23, 2022, 12:03pm

Awesome! I suppose the same process can be applied when optimizing for FNCv3? Do I understand correctly that the only difference is the feature set we are neutralizing against?

mdo · March 23, 2022, 5:16pm

Yup that is correct. The old FNC was using the old 310 features.

perfect_fit · March 23, 2022, 7:08pm

That’s clear, thank you!

bguberfain · March 24, 2022, 3:04pm

Thanks for sharing! I was wondering if there is any special meaning on training per era, like if the loss functions only make sense when used this way.

Do you think that a random batch or more than one era per batch would penalize the convergence of the model?

mdo · March 24, 2022, 3:38pm

Feature neutralization makes the most sense on a per-era basis.

olivepossum · March 24, 2022, 9:44pm

What number of features does include FNCv3? And v2 and v1?

Thanks!

bigbertha · March 25, 2022, 7:15am

FNCv3 is the 420 features of the “medium” featureset they released.
On RC they announced that code for FNCv3 will be released soon

bigbertha · March 25, 2022, 7:17am

@mdo What is the purpose of the gaussianize switch? What effect would it have to make the uniform distribution of the prediction a gaussian distribution?

olivepossum · March 25, 2022, 3:53pm

When using validation data for early stopping, does it make sense to do use eras as batches or shouldn’t make a difference there?
Thanks!

bigbertha · March 25, 2022, 6:36pm

You do check for early stopping at the end of an epoch, right?
If so, I say it is a good idea to use era-batches for training.

olivepossum · March 25, 2022, 7:25pm

You do check for early stopping at the end of an epoch, right?
Yes, I do. At the end of each epoch I use validation data to check for early stopping. My doubt is if I should calculate the total loss on the validation data by using batches per era there or it does not really matter (I would use per era batches with the train data but not for the validation_data used for early stopping).

bigbertha · March 25, 2022, 7:47pm

you calculate the corr score per era but I do not see a reason why you should predict in batches.

olivepossum · March 25, 2022, 11:49pm

My doubt is if at the end of each training epoch, it makes sense to do an early stopping check using validation data with validation eras like this:


def validation_early_stopping(val_data, model):
    model.eval()
    era_list = eras_validation.unique()
    np.random.shuffle(era_list)
    batch_count = 0
    acc_loss_val = 10000
    
    with torch.no_grad(): 
      for era in era_list:
          batch_count += 1
          # get features and target from data and put in tensors
          features = torch.tensor(val_data[val_data.erano == era].filter(items=feature_names).values) - .5
          target = torch.tensor(val_data[val_data.erano == era]['target'])
          features = features.cuda()
          target = target.cuda()

          output = model(features)
          
          # neutralize model output
          b = features.pinverse(rcond=1e-6) @ output
          linear_pred = features @ b
          neutralized_output = output - linear_pred

          neut_tb_loss, neut_tb_exp = numerai_r_tb_exposure(neutralized_output, target, features, tb=500)
          neut_loss = numerair_tb(neutralized_output, target)
          orig_loss, orig_exp = numerai_r_tb_exposure(output, target, features)
          
          loss = -neut_tb_loss - neut_loss - orig_loss + neut_tb_exp/1e3 + orig_exp/1e4
          
          acc_loss_val += loss

      loss_val = acc_loss_val / batch_count
      return loss_val.item()

As we are using TB500, I’m not sure if the size or the composition of the validation batches matters here of if it’s even conceptually correct to check early stopping like this in this case.

dzheng1887 · March 26, 2022, 9:56pm

Would there be an update soon for the numerai tournament? I wasn’t sure if something had already changed but I did not notice. I am not sure where to go for big notices like that.

pumplerod · March 28, 2022, 12:25am

Perhaps I misunderstand the meaning for TB500. I believe that to reference the Top/Bottom 500 prediction values. This is indeed a smaller subset than the full era, however in the code it looks like the TB500 samples are being used in their neutralized form as an addition to the full set of sample losses. Effectively adding extra pressure to the top and bottom 500 to improve performance.

Unless I’m reading this code incorrectly ( very possible), the neut_loss ( effectively the corr for the entire set of neutralized predictions) and the orig_loss ( corr for the entire set of raw predictions) are being maximized due to the “-” when they are included in the final loss calculation. This is also where the additional loss from the tb500 are included.

If I’m reading this wrong however, I would love a clear breakdown of the process.

mdo · March 28, 2022, 7:11pm

Sounds like you’ve got it!

pumplerod · April 5, 2022, 4:42pm

@mdo if Numerai is performing Feature Neutralization on our predictions before TC calculations, would it not help to know which features the team is using to neutralize with? As Numerai is not aware of which features we may have used to generate predictions, how have they determined the best features to use in neutralization? Or do they use them all?

I see above a mention:

FNCv3 is the 420 features of the “medium” featureset they released.
On RC they announced that code for FNCv3 will be released soon

is this information public somewhere? I don’t find it on ‘Neutralization’ section of the Docs.

mdo · April 6, 2022, 11:55pm

Predictions are not neutralized before TC calculations, just Gauss-rank transformed.

wigglemuse · April 7, 2022, 3:36am

Isn’t neutralization part of the optimizer though?