Hi guys!
Just wanted to share some insights on training Gradient Boosting Machines (GBMs) for multi-target regression to prepare for the new dataset. It also would be cool to get a discussion going on this and hear your insights.
XGBoost does not seem to support multi-target regression out of the box. This can be fixed by using sklearn’s MultiOutputRegressor. However, it will fit one regressor per target, so interactions between targets will not be learned.
As far as I understand, LightGBM and sklearn’s GradientBoostingRegressor also do not support multi-target regression out of the box.
Example of using MultiOutputRegressor for XGBoost:
from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor
estimator = XGBRegressor(objective='reg:squarederror')
model = MultiOutputRegressor(estimator=estimator).fit(X_train, y_train)
In contrast, CatBoost provides support for multi-target regression. Just make sure you set loss_function
and eval_metric
to 'MultiRMSE'
.
Example for CatBoost:
from catboost import Pool, CatBoostRegressor
dtrain = Pool(X_train, label=y_train)
dvalid = Pool(X_val, label=y_val)
params = {'learning_rate': 0.1, 'depth': 6,
'loss_function': 'MultiRMSE', 'eval_metric': 'MultiRMSE'}
model = CatBoostRegressor(**params)
model.fit(dtrain, eval_set=dvalid, use_best_model=True)
Another thing to look out for is that you might want to evaluate performance on each target separately. For this I loop over all targets, calculate spearmanr and aggregate:
Evaluation example:
import numpy as np
from scipy.stats import spearmanr
y_pred_valid = model.predict(X_val).clip(0, 1)
y_pred_train = model.predict(X_train).clip(0, 1)
train_spearmans = []
val_spearmans = []
targets = [col for col in df.columns if col.startswith("target")]
for i, target in enumerate(targets):
tr_spearman = spearmanr(y_train[:, i], y_pred_train[:, i]).correlation
val_spearman = spearmanr(y_val[:, i], y_pred_valid[:, i]).correlation
train_spearmans.append(tr_spearman)
val_spearmans.append(val_spearman)
print(f"Spearman correlation for {target}:")
print(f"Train: {tr_spearman.round(4)}")
print(f"Valid: {val_spearman.round(4)}")
mean_train_spearman = np.mean(train_spearmans)
mean_val_spearman = np.mean(val_spearmans)
print("Average Spearman over all targets:")
print(f"Train: {mean_train_spearman.round(4)}")
print(f"Valid: {mean_val_spearman.round(4)}")
Hope this helps! Very curious to hear how you are tackling the multi-output regression problem using GBMs.