3

I am facing an error regarding the Python SHAP library. While it is no problem to create force plots based on the log odds, I am not able to create force plots based on probabilities. The goal is to have base_values and shap_values which sum up to the predicted probability.

This works:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xgboost as xgb
import sklearn
import shap

X, y = shap.datasets.iris()
X_display, y_display = shap.datasets.iris(display=True)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.2, random_state = 42)

#fit xgboost model
params = {
    'objective': "multi:softprob",
    'eval_metric': "mlogloss",
    'num_class': 3
}

xgb_fit = xgb.train(
   params = params
   , dtrain = xgb.DMatrix(data = X_train, label = y_train) 
)

#create shap values and perform tests
explainer = shap.TreeExplainer(xgb_fit)
shap_values = explainer.shap_values(X_train)

And this does not work:

explainer = shap.TreeExplainer(
    model = xgb_fit
    , data = X_train
    , feature_perturbation='interventional'
    , model_output = 'probability'
)

enter image description here

Used packages:

matplotlib 3.4.1

numpy 1.20.2

pandas 1.2.4

scikit-learn 0.24.1

shap 0.39.0

xgboost 1.4.1

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72

1 Answers1

3

To see how your raw scores for multiclass classification add up in probability space try KernelExplainer:

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from shap import datasets, KernelExplainer, force_plot, initjs
from scipy.special import softmax, expit

initjs()

X, y = datasets.iris()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
clf = XGBClassifier(random_state=42, 
                    eval_metric="mlogloss", 
                    use_label_encoder=False)
clf.fit(X_train, y_train)
ke = KernelExplainer(clf.predict_proba, data=X_train)
shap_values = ke.shap_values(X_test)

force_plot(ke.expected_value[1], shap_values[1][0], feature_names=X.columns)

enter image description here

Sanity check:

  1. Expected result (up to a rounding error):
clf.predict_proba(X_test[:1])
#array([[0.0031177 , 0.9867134 , 0.01016894]], dtype=float32)
  1. Base values:
clf.predict_proba(X_train).mean(0)
#array([0.3339472 , 0.34133017, 0.32472247], dtype=float32)

(or if you wish np.unique(y_train, return_counts=True)[1]/len(y_train))

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Awesome, this works! Do you also know how to make this work using xgb.train()? – user9898927 Apr 26 '21 at 12:10
  • Ideally this should work the way you titled in the question `And this does not work` (at least I personally would join you in my expectation to work it that way). Unfortunately this doesn't work, and judged by github issues it hasn't been working for the last several months. So my quick answer: I don't know, as far as `force_plot` is concerned. Though, depending on your needs, you may try tinkering with raw scores and convert them to probabilities via softmax func. – Sergey Bushmanov Apr 26 '21 at 12:17
  • Thanks, then I will build my own workaround based on your example and hope for a future fix. Maybe in the mean time, I will have a deeper look into the raw code (https://github.com/slundberg/shap/blob/master/shap/explainers/_tree.py) to evaluate which part of the code causes the error described above. One additional point, your example raises the following warning: "Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption". For the iris dataset this is no problem, but unfortunately for my data it is. – user9898927 Apr 26 '21 at 12:53