1

I'm trying to understand how to calculate leaf values in LightGBM Classifier. I built a simple model with n_estimator=1 and max_depth=1, that means it has just one decision tree and one splitting point. I compared the scores output by the model and the scores which I calculated by myself using Python. As a result, there is a slight difference between them: -0.000457 in leaf 0 and -0.001548 in leaf 1. Is that due to the difference that LightGBM is written in C++ and I used Python for calculation? Or, is there anything wrong with my calculation steps?

references:

What is leaf_values from Python LightGBM?

[Question] How leaf output determined for LambdaMART?

import pandas as pd
import lightgbm as lgb
from sklearn import datasets

## load data
data = datasets.load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['label'] = data.target

## build a simple model
params = {'max_depth':1, 'n_estimators': 1, 'learning_rate':0.1}
model = lgb.LGBMClassifier(**params)
model.fit(df.drop('label', axis=1), df['label'])

## calculate scores by the model
df_pred = df[['label']].copy()

df_pred['score_raw'] = model.predict_proba(df.drop('label', axis=1), raw_score=True)
df_pred['score_proba'] = model.predict_proba(df.drop('label', axis=1))[:, 1]
df_pred['leaf'] = model.predict_proba(df.drop('label', axis=1), pred_leaf=True)

## calculate scores by myself
start_value = len(df[df['label']==1])/len(df)
df_pred['gradient'] = start_value - df_pred['label']
df_pred['hessian'] = 1

def calc_score_proba_(leaf):
    sum_gradients = df_pred[df_pred['leaf']==leaf]['gradient'].sum()
    sum_hessiants = df_pred[df_pred['leaf']==leaf]['hessian'].sum()
    delta = -sum_gradients / sum_hessiants * model.learning_rate
    df_pred.loc[df_pred['leaf']==leaf, 'score_proba_'] = start_value + delta

calc_score_proba_(0)
calc_score_proba_(1)

## calculate the gap between the scores
df_pred['score_proba_gap'] = df_pred['score_proba'] - df_pred['score_proba_']

df_pred
    label   score_raw   score_prob  leaf    gradient    hessian score_prob_ score_prob_gap
0   0   0.275629    0.568474    1   0.627417    1   0.570022    -0.001548
1   0   0.275629    0.568474    1   0.627417    1   0.570022    -0.001548
2   0   0.275629    0.568474    1   0.627417    1   0.570022    -0.001548
3   0   0.641339    0.655056    0   0.627417    1   0.655513    -0.000457
4   0   0.275629    0.568474    1   0.627417    1   0.570022    -0.001548
... ... ... ... ... ... ... ... ...
564 0   0.275629    0.568474    1   0.627417    1   0.570022    -0.001548
565 0   0.275629    0.568474    1   0.627417    1   0.570022    -0.001548
566 0   0.275629    0.568474    1   0.627417    1   0.570022    -0.001548
567 0   0.275629    0.568474    1   0.627417    1   0.570022    -0.001548
568 1   0.641339    0.655056    0   -0.372583   1   0.655513    -0.000457
569 rows × 8 columns

lgb.create_tree_digraph(model)

enter image description here

pira___
  • 31
  • 5

0 Answers0