Era Boosted Models

murenoha · July 16, 2021, 12:34am

Thank you for sharing era boost idea and implementation.

The first argument of spearmanr defined in this script is target.
but this function is used with pred specified as the first argument.

Pull Request

github.com/numerai/example-scripts

Fix to rank predictions instead of ranking targets

master ← murenoha:fix-order-of-arguments-in-spearmanr

opened 09:30AM - 10 Jul 21 UTC

murenoha

+1 -1

Thank you for sharing era boost idea and implementation. # Summary Current… code ranks targets. I fixed incorrect order of arguments in `spearmanr`. # Details The first argument of `spearmanr` defined in this script is `target`. ```python def spearmanr(target, pred): return np.corrcoef( target, pred.rank(pct=True, method="first") )[0, 1] ``` but this function is used with `pred` specified as the first argument. Therefore, this code ranks targets. ```python era_scores[era] = spearmanr(era_df["pred"], era_df["target"]) ``` I fixed to rank predictions.

I fixed incorrect order of arguments in spearmanr.

nickkon · October 10, 2021, 1:13pm

At least for LGBM, I do not think that this is true. I believe (but am not sure anymore) that I also checked that behaviour with xgboost.
I tried to use model.n_estimators += trees_per_step at first and noticed: Each iteration takes longer and longer to train which lead me to investigate that issue. I was using something like 100 trees per step and 10 iterations. I then did some tests with model.n_estimators += trees_per_step and without it.

Example with trees_per_step=5 and num_iters=10:

Without model.n_estimators += trees_per_step:
model.n_estimators prints 5 which is logical since it got initialized with 5 trees.
models.booster_.num_trees() prints 50 which is trees_per_step*num_iters. So 50 trees have been build with 5 per iterations, as expected. The elapsed time for each iteration is about equals since it always trains the same number of trees (5).

With model.n_estimators += trees_per_step:
model.n_estimators prints 50. The parameter n_estimators is a cumulative sum of trees_per_step for each iteration, thus the final value is trees_per_step*num_iters. Seems correct.

But: models.booster_.num_trees() prints 275 which means that 275 trees have been build. This seems weird at first. But since each iterations has an increasing number of n_estimators, it means that more and more trees are build per iterations: The first iteration builds 5 trees, the next one 10, the one after it 15, 20, 25 up until the last iteration which builds trees_per_step*num_iters trees. The final number of trees is the sum of trees for each iteration: sum((i+1)*trees_per_step for i in range(num_iters)) = 275.
Since ever iteration is building trees_per_step more trees, the training time keeps increasing from each iteration to each iteration.

Using model.n_estimators += trees_per_step or not does give different results. I added “commulative_trees: bool=False” as a parameter to decide whether I want n_estimators to keep increasing or not. The intuition behind using it might be something like:
Fit 5 trees on the easy eras, calculate the worst performing eras and use more and more trees to fit them since they are ‘harder’ to predict.

Topic		Replies	Views
Era Splitting - Invariant Learning for Gradient Boosted Decision Trees Data Science	5	1671	October 3, 2023
16GB Intermediate solution: XGB Era Boosting Tournament	54	5406	April 1, 2022
Taking advantage of Eras Data Science	6	3347	June 10, 2021
Numerai Tournament Example code using Pytorch NN and Optuna Tournament	13	2707	April 25, 2022
Era Splitting Re-ignited Data Science	0	1307	March 15, 2024

Era Boosted Models

Related topics