1

I ran a xgboost model of only 1 tree and dumped the model:

booster[0]:
0:[worst_area<884.549988] yes=1,no=2,missing=1,gain=278.707367,cover=106.5
    1:[worst_concave_points<0.135800004] yes=3,no=4,missing=3,gain=32.1795197,cover=71.5
        3:[mean_area<696.25] yes=7,no=8,missing=7,gain=3.56977844,cover=62.75
            7:leaf=0.0952191278,cover=61.75
            8:leaf=-0,cover=1
        4:[mean_texture<19.7099991] yes=9,no=10,missing=9,gain=13.5565615,cover=8.75
            9:leaf=0.0384615399,cover=5.5
            10:leaf=-0.0764705911,cover=3.25
    2:[mean_concavity<0.0721400008] yes=5,no=6,missing=5,gain=9.40318298,cover=35
        5:[mean_texture<19.5449982] yes=11,no=12,missing=11,gain=5.81390381,cover=3.25
            11:leaf=0.0454545468,cover=1.75
            12:leaf=-0.0600000024,cover=1.5
        6:leaf=-0.0969465673,cover=31.75

I suppose leaf = some value and this value is the predicted prob of this leaf. In the above tree, this value can be one of the follows (all < 0.1):

            7:leaf=0.0952191278,cover=61.75
            8:leaf=-0,cover=1
            9:leaf=0.0384615399,cover=5.5
            10:leaf=-0.0764705911,cover=3.25
            11:leaf=0.0454545468,cover=1.75
            12:leaf=-0.0600000024,cover=1.5
            6:leaf=-0.0969465673,cover=31.75

But prediction on training/test data show different values. Why is this?

Code:

import numpy as np
import pandas as pd

import xgboost as xgb
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, roc_auc_score, f1_score
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV, ParameterGrid

from sklearn import datasets

breast_cancer = datasets.load_breast_cancer()

X = pd.DataFrame(breast_cancer.data, columns = pd.Series(breast_cancer.feature_names).str.replace(' ', '_'))
y = pd.Series(breast_cancer.target)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)


params = {'learning_rate': [0.05], 'max_depth': [3], 'n_estimators': [1]}
param_grid = list(ParameterGrid(params))

xgb_model = xgb.XGBClassifier(**param_grid[0])
xgb_model = xgb_model.fit(X_train, y_train)


y_test_hat = xgb_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_hat)


prob = xgb_model.predict_proba(X_train)
prob = pd.Series(prob[:, 1], name = 'prob')
print(prob[0:10])

prob = xgb_model.predict_proba(X_test)
prob = pd.Series(prob[:, 1], name = 'prob')
print(prob[0:10])


xgb_model.get_booster().dump_model('xgb_model.txt', with_stats=True) # see model in the file
YJZ
  • 3,934
  • 11
  • 43
  • 67
  • Out of curiosity, what leads you to believe that the `leaf=` gives you the probability of each leaf, independent of other inputs? Is this somewhere in the documentation? – G. Anderson May 31 '19 at 16:43
  • oh i saw the answer there. leaf = the z value, not the prob. – YJZ May 31 '19 at 17:15

0 Answers0