I am looking into the numerical stability of classification with LightGBM and XGBoost. I believe a good place to start is the computation of the gradient and Hessian. These require computation of a logistic function which in my understanding might become unstable with very small values, as this can lead to overflow.
The following is XGBoosts implementation of binary logistic loss. Here an epsilon value is used for the computation of the Hessian, but only for the Hessian. Why is this not needed for the gradient or the sigmoid function? Why is it needed for the Hessian
struct LogisticRegression {
template <typename T>
static T PredTransform(T x) { return common::Sigmoid(x); }
static bool CheckLabel(bst_float x) { return x >= 0.0f && x <= 1.0f; }
template <typename T>
static T FirstOrderGradient(T predt, T label) { return predt - label; }
template <typename T>
static T SecondOrderGradient(T predt, T label) {
const T eps = T(1e-16f);
return std::max(predt * (T(1.0f) - predt), eps);
}
static bst_float ProbToMargin(bst_float base_score) {
CHECK(base_score > 0.0f && base_score < 1.0f)
<< "base_score must be in (0,1) for logistic loss";
return -std::log(1.0f / base_score - 1.0f);
}
static const char* LabelErrorMsg() {
return "label must be in [0,1] for logistic regression";
}
static const char* DefaultEvalMetric() { return "rmse"; }
};
// logistic loss for binary classification task.
struct LogisticClassification : public LogisticRegression {
static const char* DefaultEvalMetric() { return "error"; }
};
inline float Sigmoid(float x) {
return 1.0f / (1.0f + std::exp(-x));
}
Link for sigmoid function: https://github.com/dmlc/xgboost/blob/24f527a1c095b24115dc5d54ad35cc25d3bc3032/src/common/math.h link to objective functions: https://github.com/dmlc/xgboost/blob/master/src/objective/regression_obj.cc#L37
The following is LightGBMs implementation of GetGradients for binary logistic loss. No epsilon value similar to XGBoosts implementation is used as far as i can see. Can this lead to numerical instability?
void GetGradients(const double* score, score_t* gradients, score_t* hessians) const override {
if (weights_ == nullptr) {
#pragma omp parallel for schedule(static)
for (data_size_t i = 0; i < num_data_; ++i) {
// get label and label weights
const int is_pos = is_pos_(label_[i]);
const int label = label_val_[is_pos];
const double label_weight = label_weights_[is_pos];
// calculate gradients and hessians
const double response = -label * sigmoid_ / (1.0f + std::exp(label * sigmoid_ * score[i]));
const double abs_response = fabs(response);
gradients[i] = static_cast<score_t>(response * label_weight);
hessians[i] = static_cast<score_t>(abs_response * (sigmoid_ - abs_response) * label_weight);
}
link to binary logistic loss class https://github.com/Microsoft/LightGBM/blob/1c92e75d0342989359c469b1ffabc2901038c0f2/src/objective/binary_objective.hpp
I hope someone can help me reason about these questions as its giving me quite a hard time. If numerical instability might happen what could practical example could trigger it?
Thank you very much in advance.