I've been trying to understand how Lightgbm handless L1 loses (MAE, MAPE, HUBER)
According to this article, the gain during a split should depend only on the first and second derivatives of the loss function. This is due to the fact that Lightgbm uses a second order approximation to the loss function and consequently we can approximate the loss as follows
For L1 losses however, the absolute value of the gradient of the loss is constant and its hessian 0. I've also read that to deal with this, for loss functions with hessian = 0 we should rather use 1 as the Hessian:
"For these objective function with first_order_gradient is constant, LightGBM has a special treatment for them: (...) it will use the constant gradient for the tree structure learning, but use the residual for the leaf output calculation, with percentile function, e.g. 50% for MAE. This solution is from sklearn, and is proven to work in many benchmarks."
However, even using constant hessian doesn't make sense to me: if for instance when using MAE the gradient is the sign of the error, the squared gradient doesn't give us information. Does it mean that when the gradient is constant, LightGbm does not use the second order approximation, and defaults to traditional gradient boosting?
On the other hand, when reading about GOSS boosting the original lightgbm paper
for the GOSS boosting strategy, the authors consider the square of the sum of the gradients. I see the same problem as above: if the gradient of the MAE is the sign of the error, how does taking the square of the gradient reflect a gain? Does it mean that also GOSS won't work with loss functions with constant gradient?
Thanks in advance,