I had fitted a XGBoost model for binary classification. I am trying to understand the fitted model and trying to use SHAP to explain the prediction.
However, I get confused by the force plot generated by SHAP. I expected the output value should be smaller than 0 as the predicted probability is less than 0.5. However, the SHAP value shows 8.12
.
Below are my code to generate the result.
import shap
import xgboost as xgb
import json
from scipy.sparse import load_npz
print('Version of SHAP: {}'.format(shap.__version__))
print('Version of XGBoost: {}'.format(xgb.__version__))
Version of SHAP: 0.39.0
Version of XGBoost: 1.4.1
# Read the data
X = load_npz('test_data.npz')
X_dmatrix = xgb.DMatrix(X)
# Read the selected features
with open('feature_list.json', 'r') as file:
feature_list = json.load(file)
feature_names = [f'Feature {x:04d}' for x in range(len(feature_list))]
# Read the XGBoost model
xgb_model = xgb.Booster()
xgb_model.load_model('xgboost.json')
# Model prediction
model_pred_detail = xgb_model.predict(X_dmatrix, pred_contribs=True)
model_pred_prob = xgb_model.predict(X_dmatrix)
model_pred_detail.shape
(7887, 501)
# Random select a case
xid=4549
print('Predict proba: {:.04f}'.format(model_pred_prob[xid]))
Predict proba: 0.2292
# Doing SHAP way (https://github.com/slundberg/shap)
explainer = shap.Explainer(xgb_model, feature_names=feature_names, algorithm='tree')
shap_values = explainer(X.toarray())
shap.plots.force(shap_values[xid])
However, I get another plot if I use the SHAP value from XGBoost library which looks similar to my expectation.
shap.force_plot(
model_pred_detail[xid, -1], # From XGBoost.Booster.predict with pred_contribs=True
model_pred_detail[xid, 0:-1], # From XGBoost.Booster.predict with pred_contribs=True
feature_names=feature_names,
features=X[xid].toarray()
)
Why does this happen? Which one should be the correct SHAP value to explain the XGBoost model?
Thank you for your help.
Follow up with the reply from @sergey-bushmanov
Since I cannot share my own data, I reproduce the situation with open dataset from Kaggle.
Here is the code for model training:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from matplotlib import pyplot
import io
from scipy.sparse import save_npz
# parameter setting
class_weight = 10
minNgramLength = 1
maxNgramLength = 3
binary = False
min_df = 20
# Convert to fix the problem of encoding
with open('Corona_NLP_train.csv', 'rb') as file:
csv_file = file.read()
csv_file2 = csv_file.decode('utf-8', errors='replace')
# Read and split data
df_note = pd.read_csv(io.StringIO(csv_file2), encoding='utf-8')
df_note['label'] = np.where(df_note['Sentiment'].str.contains('negative', flags=re.I), 0, 1)
df_train, df_test = train_test_split(df_note, test_size=0.2, random_state=42)
# Tokenization
vectorizer = CountVectorizer(max_df=0.98,
min_df=min_df,
binary=binary,
ngram_range=(minNgramLength, maxNgramLength))
vectorizer.fit(df_train['OriginalTweet'])
X_train = vectorizer.transform(df_train['OriginalTweet']).astype(float)
y_train = df_train['label'].astype(float).reset_index(drop=True)
last_params ={
'lambda': 0.00016096144192346114,
'alpha': 0.057770973181367063,
'eta': 0.19258319097144733,
'gamma': 0.40032424821976653,
'max_depth': 9,
'min_child_weight': 5,
'subsample': 0.31304772813494836,
'colsample_bytree': 0.4214452441229869,
'objective': 'binary:logistic',
'verbosity': 0,
'n_estimators': 400
}
classifierCV = xgb.XGBClassifier(**last_params, importance_type='gain')
classifierCV.fit(X_train, y_train, sample_weight=w_train)
# Get the features
feature_names = vectorizer.get_feature_names()
# save model
classifierCV.get_booster().save_model('xgboost.json')
# Save features
import json
with open('feature_list.json', 'w') as file:
file.write(json.dumps({y:x for x, y in enumerate(feature_names)}))
# save data
save_npz('test_data.npz', X_train)
The problem is still here with this model.