Super Massive LGBM Grid Search

Introduction

We at Numerai have spent the last couple of weeks conducting a new grid search on the Sunshine V4.1 data set. We have built hundreds of models with different hyperparameters on the V4.1 dataset focussed on target_cyrus_v4_20 which we believe is the best single target at present for our hedge fund strategy.

We are sharing the best of these grid searched results in terms of correlation and correlation Sharpe to enable users to benefit from the grid search and either use these results directly in their models or allow them to do more targeted searches of their own around the sweet spots.

Experimental Setup

We did the grid search using the following parameters:-

Features = all the features in the V4.1 dataset

Target = target_cyrus_v4_20

Algorithm = Scikit-learn API LGBMRegressor

Hyperparameter ranges for the better results are shared below:-

n_estimators = 30k - 60k

learning_rate = 0.001

max_depth = 5, 6, 7

num_leaves = 2**max_depth - 1

colsample_bytree = 0.1

Results

The first plot below contains the correlation and the second the correlation Sharpe computed using the out of sample predictions from era 578 to 1059 inclusive. The training period was from era1 to era 574 inclusive.

Table 1 below contains the 20 best correlation results.

Table1

Table 2 below contains the 20 best correlation Sharpe results.

Table 2

**The plot below shows the cumulative correlation of

  1. The sunshine recommended param model with learning_rate = 0.001, n_estimators = 20K and
    max_depth = 6.

  2. The best correlation model from the above table.

  3. The 2 best correlation Sharpe models from the above table.

Alternate hyperparameters for less compute

The above parameters require 6 hours to compute for a tree of max_depth = 6 , a learning rate of 0.001 and n_estimators = 100k on a 24 core processor. This may be a heavy compute burden for users.

We show below parameters that work with lower compute using a learning rate of 0.01, n_estimators = 20k and columnsample_bytree = 0.1. This reduces the compute time to less than 2 hours.

Conclusion

The above results are very competitive in correlation Sharpe space while slightly worse in terms of pure correlation but with significant compute saving.

21 Likes

Is that possible to rerun the best hyper-parameters for 5 different random seeds and report the mean and standard deviation of Corr and Sharpe for each hyper-parameter?

3 Likes

Came here to say that. Results are usually really sensitive to seed. Also I am pretty sure l1, l2 reg, boosting mode, allowing for regression on the last level can change the results.

1 Like

Has anyone tried using a log loss cost function instead of a regression? That’ll just be reducing the log loss on that where actuals are 0, 0.25, 0.5, 0.75, and 1 instead of just 0/1. Wondering if probability conversion helps with fitting due to the curvature in the cost function.

It also has other interesting properties like 0.5 guess on any observation is the same cost regardless of the true label, kind of “standardizing” 0.5 as an acceptable placeholder guess.

Also, the comparative improvement of going from the worst prediction to okay compared to going from okay to perfect is higher using the log loss cost function than squared error. This seems to incentivize a model to find more “average” parameters that benefits the overall group of assets more than an individual asset. To illustrate, look at the actual == 0 observation

  • 10x improvement to go from 0.999 to 0.5 for log loss, 4x improvement for sq err
  • 693x improvement to go from 0.5 to 0.001 for log loss, 250,000x improvement for sq err
predicted
log loss 0.001 0.25 0.5 0.75 0.999
actual 0 0.001 0.288 0.693 1.386 6.908
0.25 1.728 0.562 0.693 1.112 5.181
0.5 3.454 0.837 0.693 0.837 3.454
0.75 5.181 1.112 0.693 0.562 1.728
1 6.908 1.386 0.693 0.288 0.001
predicted
sq err 0.001 0.25 0.5 0.75 0.999
actual 0 0.000 0.063 0.250 0.563 0.998
0.25 0.062 0.000 0.063 0.250 0.561
0.5 0.249 0.063 0.000 0.063 0.249
0.75 0.561 0.250 0.063 0.000 0.062
1 0.998 0.563 0.250 0.063 0.000
4 Likes

Cool experiment!

The case for 20,000+ tree models seems dubious. Maybe a gain of +0.0010 Corr? Glad you guys showed the results for sub 20k tree models.

More important question: how many trees for good TC?

EDIT: OK fine, I will admit the 20k+ tree models are generally more stable at higher depths.

2 Likes

Thanks guys. Can you publish the tables for less compute version?

1 Like

… also: are you considering grid search for XGBoost? Especially for low compute settings?

Maybe this preprint can help? [2303.07925] Robust incremental learning pipelines for temporal tabular datasets with distribution shifts

1 Like

How did the approaches in your preprint work in May @thomasxthomas. Did you have models which reliably avoided the drawdown so many models experienced? Which approach worked best?

Is there a reason:

  • a full grid search was done and not a more sophisticated method of hyperparameter search?
  • important lightgbm parameters like min_data_in_leaf or regularization was not included?

I don’t understand this experiments, since better results were published here:

What is the reason why you perform a huge grid search and get worse results?