1

I had fitted a XGBoost model for binary classification. I am trying to understand the fitted model and trying to use SHAP to explain the prediction.

However, I get confused by the force plot generated by SHAP. I expected the output value should be smaller than 0 as the predicted probability is less than 0.5. However, the SHAP value shows 8.12.

Below are my code to generate the result.

import shap
import xgboost as xgb
import json
from scipy.sparse import load_npz

print('Version of SHAP: {}'.format(shap.__version__))
print('Version of XGBoost: {}'.format(xgb.__version__))

Version of SHAP: 0.39.0

Version of XGBoost: 1.4.1

# Read the data
X = load_npz('test_data.npz')
X_dmatrix = xgb.DMatrix(X)

# Read the selected features
with open('feature_list.json', 'r') as file:
    feature_list = json.load(file)
    
feature_names = [f'Feature {x:04d}' for x in range(len(feature_list))]

# Read the XGBoost model
xgb_model = xgb.Booster()
xgb_model.load_model('xgboost.json')

# Model prediction

model_pred_detail = xgb_model.predict(X_dmatrix, pred_contribs=True)
model_pred_prob = xgb_model.predict(X_dmatrix)
model_pred_detail.shape

(7887, 501)

# Random select a case
xid=4549
print('Predict proba: {:.04f}'.format(model_pred_prob[xid]))

Predict proba: 0.2292

# Doing SHAP way (https://github.com/slundberg/shap)
explainer = shap.Explainer(xgb_model, feature_names=feature_names, algorithm='tree')
shap_values = explainer(X.toarray())

shap.plots.force(shap_values[xid])

enter image description here

However, I get another plot if I use the SHAP value from XGBoost library which looks similar to my expectation.

shap.force_plot(
    model_pred_detail[xid, -1], # From XGBoost.Booster.predict with pred_contribs=True
    model_pred_detail[xid, 0:-1], # From XGBoost.Booster.predict with pred_contribs=True
    feature_names=feature_names, 
    features=X[xid].toarray()
)

enter image description here

Why does this happen? Which one should be the correct SHAP value to explain the XGBoost model?

Thank you for your help.

Follow up with the reply from @sergey-bushmanov

Since I cannot share my own data, I reproduce the situation with open dataset from Kaggle.

Here is the code for model training:


import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from matplotlib import pyplot
import io
from scipy.sparse import save_npz


# parameter setting
class_weight = 10
minNgramLength = 1
maxNgramLength = 3
binary = False
min_df = 20

# Convert to fix the problem of encoding
with open('Corona_NLP_train.csv', 'rb') as file:
    csv_file = file.read()
csv_file2 = csv_file.decode('utf-8', errors='replace')

# Read and split data
df_note = pd.read_csv(io.StringIO(csv_file2), encoding='utf-8')
df_note['label'] = np.where(df_note['Sentiment'].str.contains('negative', flags=re.I), 0, 1)

df_train, df_test = train_test_split(df_note, test_size=0.2, random_state=42)

# Tokenization
vectorizer = CountVectorizer(max_df=0.98,
                             min_df=min_df,
                             binary=binary,
                            ngram_range=(minNgramLength, maxNgramLength))
vectorizer.fit(df_train['OriginalTweet'])
X_train = vectorizer.transform(df_train['OriginalTweet']).astype(float)
y_train = df_train['label'].astype(float).reset_index(drop=True)

last_params ={
 'lambda': 0.00016096144192346114,
 'alpha': 0.057770973181367063,
 'eta': 0.19258319097144733,
 'gamma': 0.40032424821976653,
 'max_depth': 9,
 'min_child_weight': 5,
 'subsample': 0.31304772813494836,
 'colsample_bytree': 0.4214452441229869,
 'objective': 'binary:logistic',
 'verbosity': 0,
 'n_estimators': 400
}

classifierCV = xgb.XGBClassifier(**last_params, importance_type='gain')
classifierCV.fit(X_train, y_train, sample_weight=w_train)

# Get the features
feature_names = vectorizer.get_feature_names()

# save model
classifierCV.get_booster().save_model('xgboost.json')

# Save features
import json

with open('feature_list.json', 'w') as file:
    file.write(json.dumps({y:x for x, y in enumerate(feature_names)}))

# save data
save_npz('test_data.npz', X_train)

The problem is still here with this model.

Felix Chan
  • 185
  • 1
  • 2
  • 9
  • Can you put up a complete start to end reproducible example, including data, and showing up what "problem" you have? What I see so far is a compilation of 2 cases: one without data and with a problem, and another with data, but not specifying what exact problem needs to be solved. – Sergey Bushmanov Nov 15 '21 at 11:19
  • Sorry for late reply. I had put the complete notebook [here](https://www.kaggle.com/felixchan/strange-behavior-shap/notebook) for your reference. Thank you. – Felix Chan Nov 25 '21 at 09:22

2 Answers2

0

Which one should be the correct SHAP value to explain the XGBoost model?

Let's make a guess you have a binary classification at hand. Then, what you're getting in your 2nd example is indeed the right decomposition of raw SHAP values:

In [1]: from scipy.special import expit
In [2]: expit(-1.21)
Out[2]: 0.22970105095339813 

Note, .2297 is close to what you see in your:

Predict proba: 0.2292

As for:

Why does this happen?

most probably you have a typo somewhere, but to be sure you have to provide a fully reproducible example, including your data, because codewise both ways calculating SHAP values are correct.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Thank you Sergey. I cannot share my own dataset so I reproduce the situation with open [dataset](https://www.kaggle.com/datatattle/covid-19-nlp-text-classification/version/1?select=Corona_NLP_train.csv) – Felix Chan Nov 15 '21 at 09:21
0

I find that the implementation in the XGBoost should be similar to TreeSHAP (Slundberg also joined). I have tested and see the similar result. You can reproduce here: Reproduce results.

Note that, the attribute approx_contribs is recommended to set False.

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 01 '23 at 18:01