How does Lightgbm (or other boosted trees implementations with 2nd order approximations of the loss) work for L1 losses?

Question

I've been trying to understand how Lightgbm handless L1 loses (MAE, MAPE, HUBER)

According to this article, the gain during a split should depend only on the first and second derivatives of the loss function. This is due to the fact that Lightgbm uses a second order approximation to the loss function and consequently we can approximate the loss as follows

For L1 losses however, the absolute value of the gradient of the loss is constant and its hessian 0. I've also read that to deal with this, for loss functions with hessian = 0 we should rather use 1 as the Hessian:

"For these objective function with first_order_gradient is constant, LightGBM has a special treatment for them: (...) it will use the constant gradient for the tree structure learning, but use the residual for the leaf output calculation, with percentile function, e.g. 50% for MAE. This solution is from sklearn, and is proven to work in many benchmarks."

However, even using constant hessian doesn't make sense to me: if for instance when using MAE the gradient is the sign of the error, the squared gradient doesn't give us information. Does it mean that when the gradient is constant, LightGbm does not use the second order approximation, and defaults to traditional gradient boosting?

On the other hand, when reading about GOSS boosting the original lightgbm paper

for the GOSS boosting strategy, the authors consider the square of the sum of the gradients. I see the same problem as above: if the gradient of the MAE is the sign of the error, how does taking the square of the gradient reflect a gain? Does it mean that also GOSS won't work with loss functions with constant gradient?

Thanks in advance,

score 1 · Answer 1 · answered Feb 10 '20 at 18:00

I've asked this in the Lightgbm repo and got this answer:

Before this version, we use the second-order approximation, but its performance actually is not good. And we switch back to 1) use first-order gradient to find split point; 2) then use the median of residuals for leaf outputs, as shown in the above code.

So it seems Lightgbm will treat the already implemented L1 losses using gradient descent. For custom loss functions, it will still try to do the 2nd order approx.

How does Lightgbm (or other boosted trees implementations with 2nd order approximations of the loss) work for L1 losses?

1 Answers1