One thing I’m not clear on is how building “layers (of different trees) at each depth simultaneously” is compatible with boosting as each tree is trained on gradients calculated using the full ensemble up to that point.
Node values are decided at the time of their layer’s creation with the gradient at that round, so I don’t think one can accurately call the gradient out of date. Maybe you could say that the splits above leaves are, but I can’t imagine that mattering with the learning rates we’re dealing with.
Also IME tree-wise sampling (eg colsample_bytree) works better than layer-wise sampling (eg colsample_bylevel) because the learnable representation for each tree is more constrained because they see fewer features and thus the tree predictions are more diverse than is the case in layer-wise sampling
Whether trees see more total features under layer-wise sampling is a matter of the respective sample rates. To the extent that I tested it in xgboost, layer-wise sampling outperformed tree-wise sampling when each were given their best performing rates.
If we consider the limit of the gradient boosting process as the learning rate approaches zero and the number of trees approaches infinity, the features sampling scheme becomes a weighting scheme: the weight of each possible leaf at some point in training being the probability that a random tree will include it which is the product of the probabilities of each ancestor node’s split provided that the splits that lead to them were taken.
Under a layer sampling scheme, the probability that a node splits on a feature is entirely independent of the splits taken to reach that node and simply the probability that the feature will be the best scoring cut in a random feature sub-sample. This is simply a function of the rank order of that feature’s cut score against all others. The relative weights of all trees that include each cut are given by this function in the limit case. This is what that function which relates cut score rank to cuts’ trees’ relative weights looks like given a couple different sampling rates:
Tree-wise sampling complicates this in two ways: first cuts cannot exist at a node if cuts on that feature would have scored better on the ancestor cuts than the feature-cut-path taken to reach the node. Second, features used in previous cuts must necessarily be available to the node and thus have a higher probability (and therefore weight) than they would otherwise. I think it would be very difficult to argue that the first condition is beneficial as it appears to punish features for being good, and it weights some good cuts to zero which doubtlessly hurts variety. If good cuts are too high weighted, it make more sense to gently adjust the weight of all good cuts by reducing the sample rate than to set some good cuts to zero by using tree-level sampling. The second effect is less easy to judge, but it can be reproduced in a layer sampling scheme by adding just those few features used in parent cuts to a layer’s feature set without the side effect of the first, or at least it could if I wasn’t using an extra trees scheme to begin with.
Taking this limit might seem overly theoretical, but learning rates below a certain rate seem to see drastically diminishing returns in terms of improvement which makes me think the limit is relevant.
have you tried implementing in Triton, it seems like it would be quite similar implementation but has some nice interfaces to pytorch
Briefly looking at Triton, I don’t think it’s suitable as it appears to abstract away shared memory and warp boundaries. This is probably fine for basic tensor manipulation, but building trees with the highest efficiency requires manipulating shared memory and the distribution of data to warps carefully.