Custom loss functions for XGBoost using PyTorch

Here is some code showing how you can use PyTorch to create custom objective functions for XGBoost. Objective functions for XGBoost must return a gradient and the diagonal of the Hessian (i.e. matrix of second derivatives). Internally XGBoost uses the Hessian diagonal to rescale the gradient. The Hessian is very expensive to compute, so we replace it with all ones. This basically forces XGBoost to do standard gradient descent rather than the fancier second order version it usually uses. It works fine, but makes it a bit more sensitive to step size, so watch things carefully. Below we make a function to use Adjusted Sharpe as a cost function for XGBoost. Because the Adjusted Sharpe calculation is not defined for a constant initial condition, we first fit a model using the standard least squares cost-function and then start from there (i.e. the base margin) and fit additional trees to improve the in-sample adjusted Sharpe. This should be enough to get you going, have fun!

import numpy as np 
import pandas as pd 
from xgboost import XGBRegressor 
import torch
from torch.autograd import grad


trainval=pd.read_parquet("numerai_training_validation_target_nomi.parquet")
train = trainval[trainval.data_type=='train']

target = "target_nomi" 
feature_columns = [c for c in trainval if c.startswith("feature")] 

# fit an initial model
model_init = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=2000, colsample_bytree=0.1, nthread=6)
model_init.fit(train[feature_columns], train[target])

# get prediction from initial model as starting point to improve upon
base_margin = model_init.predict(train[feature_columns])

# get indexes for each era
era_idx = [np.where(train.era==uera)[0] for uera in train.era.unique()]


# define adjusted sharpe in terms of cost adjusted numerai sharpe
def numerai_sharpe(x):
    return (x.mean() -0.010415154) / x.std()

def skew(x):
    mx = x.mean()
    m2 = ((x-mx)**2).mean()
    m3 = ((x-mx)**3).mean()
    return m3/(m2**1.5)    

def kurtosis(x):
    mx = x.mean()
    m4 = ((x-mx)**4).mean()
    m2 = ((x-mx)**2).mean()
    return (m4/(m2**2))-3

def adj_sharpe(x):
    return numerai_sharpe(x) * (1 + ((skew(x) / 6) * numerai_sharpe(x)) - ((kurtosis(x) / 24) * (numerai_sharpe(x) ** 2)))

# use correlation as the measure of fit
def corr(pred, target):
    pred_n = pred - pred.mean(dim=0)
    pred_n = pred_n / pred_n.norm(dim=0)

    target_n = target - target.mean(dim=0)
    target_n = target_n / target_n.norm(dim=0)
    l = torch.matmul(pred_n.T, target_n)
    return l

# definte a custom objective for XGBoost
def adj_sharpe_obj(ytrue, ypred):
    # convert to pytorch tensors
    ypred_th = torch.tensor(ypred, requires_grad=True)
    ytrue_th = torch.tensor(ytrue)
    all_corrs = []

    # get correlations in each era
    for ee in era_idx:
        score = corr(ypred_th[ee], ytrue_th[ee])
        all_corrs.append(score)

    all_corrs = torch.stack(all_corrs)

    # calculate adjusted sharpe using correlations
    loss = -adj_sharpe(all_corrs)
    print(f'Current loss:{loss}')

    # calculate gradient and convert to numpy
    loss_grads = grad(loss, ypred_th, create_graph=True)[0]
    loss_grads = loss_grads.detach().numpy()

    # return gradient and ones instead of Hessian diagonal
    return loss_grads, np.ones(loss_grads.shape)


model_adj_sharpe = XGBRegressor(max_depth=5, learning_rate=0.01, n_estimators=200, nthread=6, colsample_bytree=0.1, objective=adj_sharpe_obj)
model_adj_sharpe.fit(train[feature_columns], train[target], base_margin=base_margin)
12 Likes

Thank you @mdo for submitting this example! Minor comment, you forgot to import numpy as np.

1 Like

Has someone tried to perform cross validation to this model?

The snippet below can replace the last line of mdo code…

param_fit_grid = { 'base_margin' : base_margin}

score = model_selection.cross_val_score(
                model_adj_sharpe,
                train[feature_columns],
                train[target],
                cv=3,
                n_jobs=-1,
                scoring=make_scorer(mean_squared_error),
                fit_params=param_fit_grid,
                error_score=123)

print(score)

However, it returns my error_score=123, after some investigation I guess the problem occurs here:

    # get correlations in each era
    for ee in era_idx:
        score = corr(ypred_th[ee], ytrue_th[ee])

More exactly ypred_th[ee], apparently after 1 successful cv-fold, it can´t find the the respective index on ypred_th tensor.

Moreover, if you replace de the objective function to squared_log example function as shown on xgb docs it works fine on cross_val_score.

One more thing

The order of the parameters in adj_sharpe_obj(ytrue, ypred) is flipped according to xgb_docs standards, not sure if it can create any noise.

I would just write your own cross validation code to make sure you know what it’s doing with a custom loss like this, and make sure you always to cross-validation era-wise, which it doesn’t look like you were trying to do.
And you had me worried for a sec, but if check the custom loss documents for the sklearn api you’ll see that my code has it in the correct order:
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn


It’s totally confusing that it is in the opposite order in the two api’s but what are you gonna do, ask for your money back? :man_shrugging:

Intel vs AT&T, all over again.

PS: Systems joke on a data science forum. :slight_smile:

I am aware of era-wise cv, I let cv=3 for for simplicity.

When you say “write your own cross validation code”, you are suggesting to extend
BaseSearchCV class similarly you did here?

class TimeSeriesSplitGroups(_BaseKFold)

Sorry about the confusion with xgb docs…

From your response " I am aware of era-wise cv, I let cv=3 for for simplicity" I’m not sure you are getting what I mean. The number of folds is independent from what I mean by era-wise cv. By era-wise cv I mean all the break points between folds are at eras, so no folds contain partial eras. You have to use the groups argument in sklearn splitters to get that behavior. But I was suggesting not using the sklearn stuff and just doing the indexing yourself to make sure everything is working as expected.

Hi michael, appreciate you attention, I believe I know what you mean…

from example scripts

#Group K-fold
CV = model_selection.GroupKFold(n_splits = 3)
grp = list(CV.split(X = X_train, y = y_train,  groups = era_train)

Unless I’ missing something… but replace to cv=grp in my original snippet doesn’t change the output…