Gradient Boosting Machines for multi-target regression

perfect_fit · October 28, 2021, 11:45am

Hi guys!

Just wanted to share some insights on training Gradient Boosting Machines (GBMs) for multi-target regression to prepare for the new dataset. It also would be cool to get a discussion going on this and hear your insights.

XGBoost does not seem to support multi-target regression out of the box. This can be fixed by using sklearn’s MultiOutputRegressor. However, it will fit one regressor per target, so interactions between targets will not be learned.

As far as I understand, LightGBM and sklearn’s GradientBoostingRegressor also do not support multi-target regression out of the box.

Example of using MultiOutputRegressor for XGBoost:

from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor

estimator = XGBRegressor(objective='reg:squarederror')
model = MultiOutputRegressor(estimator=estimator).fit(X_train, y_train)

In contrast, CatBoost provides support for multi-target regression. Just make sure you set loss_function and eval_metric to 'MultiRMSE'.

Example for CatBoost:

from catboost import Pool, CatBoostRegressor
dtrain = Pool(X_train, label=y_train)
dvalid = Pool(X_val, label=y_val)

params = {'learning_rate': 0.1, 'depth': 6, 
          'loss_function': 'MultiRMSE',  'eval_metric': 'MultiRMSE'}

model = CatBoostRegressor(**params)
model.fit(dtrain, eval_set=dvalid, use_best_model=True)

Another thing to look out for is that you might want to evaluate performance on each target separately. For this I loop over all targets, calculate spearmanr and aggregate:

Evaluation example:

import numpy as np
from scipy.stats import spearmanr

y_pred_valid = model.predict(X_val).clip(0, 1)
y_pred_train = model.predict(X_train).clip(0, 1)
train_spearmans = []
val_spearmans = []
targets = [col for col in df.columns if col.startswith("target")]
for i, target in enumerate(targets):
    tr_spearman = spearmanr(y_train[:, i], y_pred_train[:, i]).correlation
    val_spearman = spearmanr(y_val[:, i], y_pred_valid[:, i]).correlation
    train_spearmans.append(tr_spearman)
    val_spearmans.append(val_spearman)
    print(f"Spearman correlation for {target}:")
    print(f"Train: {tr_spearman.round(4)}")
    print(f"Valid: {val_spearman.round(4)}")
mean_train_spearman = np.mean(train_spearmans)
mean_val_spearman = np.mean(val_spearmans)
print("Average Spearman over all targets:")
print(f"Train: {mean_train_spearman.round(4)}")
print(f"Valid: {mean_val_spearman.round(4)}")

Hope this helps! Very curious to hear how you are tackling the multi-output regression problem using GBMs.

eleven_sigma · October 28, 2021, 7:41pm

Interesting. Do you found any documentation of how apply CatBoost the boosting with multitarget?
I didn’t found nothing about the approach used.

perfect_fit · October 28, 2021, 8:02pm

Good question. The CatBoost documentation can be really vague.

I found this short note on how MultiRMSE is calculated:
https://catboost.ai/en/docs/concepts/loss-functions-multiregression#MultiRMSE

@hedgingcat has an awesome implementation example. This was one of the few code examples I found for using MultiRMSE with CatBoost:
https://www.kaggle.com/gogo827jz/multiregression-catboost-1-model-for-206-targets

hedgingcat · October 28, 2021, 8:17pm

This notebook is old. I have found the latest version of Catboost even supports multilogloss with custom metric. However, GPU is stil not supported.

eleven_sigma · October 28, 2021, 9:04pm

In documentation refers to ‘error metric’ when talk about multiRMSE, not ‘objective’ that is usually the name of internal function used for compute gradient / hessian for boosting.
Do you think this is a true multiresponse and not a wrapper to something like MultiOutputRegressor in sklearn?
I’m looking the code and don’t find the part of gradient computation for multiRMSE.

perfect_fit · October 30, 2021, 11:22am

Hmm, the documentation seems to imply that ‘loss_function’ is an alias of ‘objective’, so MultiRMSE should be an objective. However, I also can’t find any information about the internal function to compute gradient / Hessian. The documentation uses loss function and metric synonymously sometimes, which makes it even more confusing.

Docs including ‘loss_function’ definition:
https://catboost.ai/en/docs/references/training-parameters/common