Numerical stability of gradient and hessian computation of LightGBM and XGBoost

Question

I am looking into the numerical stability of classification with LightGBM and XGBoost. I believe a good place to start is the computation of the gradient and Hessian. These require computation of a logistic function which in my understanding might become unstable with very small values, as this can lead to overflow.

The following is XGBoosts implementation of binary logistic loss. Here an epsilon value is used for the computation of the Hessian, but only for the Hessian. Why is this not needed for the gradient or the sigmoid function? Why is it needed for the Hessian

struct LogisticRegression {
template <typename T>
static T PredTransform(T x) { return common::Sigmoid(x); }
static bool CheckLabel(bst_float x) { return x >= 0.0f && x <= 1.0f; }
template <typename T>
static T FirstOrderGradient(T predt, T label) { return predt - label; }
template <typename T>
static T SecondOrderGradient(T predt, T label) {
  const T eps = T(1e-16f);
  return std::max(predt * (T(1.0f) - predt), eps);
}
static bst_float ProbToMargin(bst_float base_score) {
  CHECK(base_score > 0.0f && base_score < 1.0f)
      << "base_score must be in (0,1) for logistic loss";
  return -std::log(1.0f / base_score - 1.0f);
}
static const char* LabelErrorMsg() {
  return "label must be in [0,1] for logistic regression";
}
static const char* DefaultEvalMetric() { return "rmse"; }
};

// logistic loss for binary classification task.
struct LogisticClassification : public LogisticRegression {
  static const char* DefaultEvalMetric() { return "error"; }
};


inline float Sigmoid(float x) {
   return 1.0f / (1.0f + std::exp(-x));
}

Link for sigmoid function: https://github.com/dmlc/xgboost/blob/24f527a1c095b24115dc5d54ad35cc25d3bc3032/src/common/math.h link to objective functions: https://github.com/dmlc/xgboost/blob/master/src/objective/regression_obj.cc#L37

The following is LightGBMs implementation of GetGradients for binary logistic loss. No epsilon value similar to XGBoosts implementation is used as far as i can see. Can this lead to numerical instability?

void GetGradients(const double* score, score_t* gradients, score_t* hessians) const override {
if (weights_ == nullptr) {
  #pragma omp parallel for schedule(static)
  for (data_size_t i = 0; i < num_data_; ++i) {
    // get label and label weights
    const int is_pos = is_pos_(label_[i]);
    const int label = label_val_[is_pos];
    const double label_weight = label_weights_[is_pos];
    // calculate gradients and hessians
    const double response = -label * sigmoid_ / (1.0f + std::exp(label * sigmoid_ * score[i]));
    const double abs_response = fabs(response);
    gradients[i] = static_cast<score_t>(response * label_weight);
    hessians[i] = static_cast<score_t>(abs_response * (sigmoid_ - abs_response) * label_weight);
  }

link to binary logistic loss class https://github.com/Microsoft/LightGBM/blob/1c92e75d0342989359c469b1ffabc2901038c0f2/src/objective/binary_objective.hpp

I hope someone can help me reason about these questions as its giving me quite a hard time. If numerical instability might happen what could practical example could trigger it?

Thank you very much in advance.

score 1 · Answer 1 · answered Dec 18 '17 at 03:06

1

LightGBM uses another method to solve the numerical stability.

LightGBM will limit the min/max value of leaves: https://github.com/Microsoft/LightGBM/blob/master/include/LightGBM/tree.h#L14
And when calculating the leaf output, it will add an epsilon: refer to: https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/feature_histogram.hpp#L76 and https://github.com/Microsoft/LightGBM/blob/master/src/treelearner/feature_histogram.hpp#L328

The sum_hessians will be always greater than epsilon.

answered Dec 18 '17 at 03:06

MaxInsulator

26
3

Thank you! Must admit i'm having a hard time finding my way in the repository Do you known why is it chosen to limit the min/max number of leaves? I guess it is used against overfitting, but i find it weird that it is not hyperparameter. Maybe because of computation efficiency. Can you explain why the Hessian must not be 0? I don't see a standard computation of newtons method where the gradient is divided by the Hessian. – Simon Dec 21 '17 at 09:59
@SejlerMonster It didn't limit the min/max number of leaves. It limits the output of leaves. you can use `num_leaves` to limit the number of leaves. If you use second-order Taylor expansion, the optimal solution is the gradient/Hessian ( details can be found in https://xgboost.readthedocs.io/en/latest/model.html#additive-training ) – MaxInsulator Dec 22 '17 at 10:45

Numerical stability of gradient and hessian computation of LightGBM and XGBoost

1 Answers1