3

I am working on a binary classification using random forest and trying out SHAP to explain the model predictions.

However, I would like to convert the SHAP local explanation plots with values into a pandas dataframe for each instance.

Is there any one here who can help me with exporting SHAP local explanations to pandas dataframe for each instance?

I know that SHAPASH has .to_pandas() method but couldn't find anything like that in SHAP

I tried something like below based on the SO post here but it doesn't help

feature_names = shap_values.feature_names
    shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
    vals = np.abs(shap_df.values).mean(0)
    shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
    shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)

I expect my output something like below. Here, negative sign indicates feature contribution for class 0 and positive values indicates feature contribution for class 1

subject_id       Feature importance      value (contribution)
   1                       F1                  31
   1                       F2                  27
   1                       F3                  20
   1                       F5                  - 10
   1                       F9                  - 29
The Great
  • 7,215
  • 7
  • 40
  • 128

1 Answers1

4

If you have a model like this:

import xgboost
import shap
import warnings
warnings.filterwarnings("ignore")

# train XGBoost model
X,y = shap.datasets.boston()
model = xgboost.XGBRegressor().fit(X, y)

# explain the model's predictions using SHAP values
# (same syntax works for LightGBM, CatBoost, and scikit-learn models)
background = shap.maskers.Independent(X, max_samples=100)
explainer = shap.Explainer(model, background, algorithm="tree")
sv = explainer(X)

you can decompose your results like this:

sv.base_values[0]

22.342787810446044

sv.values[0]

array([-7.68297079e-01, -4.38205232e-02,  3.46814548e-01, -4.06731364e-03,
       -3.17875379e-01, -5.37296545e-01,  2.68567768e-01, -1.30198611e+00,
       -4.83524088e-01, -4.39375216e-01,  2.94188969e-01,  2.43096180e-02,
        4.63890554e+00])

model.predict(X.iloc[[0]])

array([24.019339], dtype=float32)

Which is exactly equal to:

sv.base_values[0] + sum(sv.values[0])

24.01933200249436

If you want to put results to Pandas df:

pd.DataFrame(sv.values[0], index = X.columns)

         0
CRIM    -0.768297
ZN      -0.043821
INDUS    0.346815
CHAS    -0.004067
NOX     -0.317875
RM      -0.537297
AGE      0.268568
DIS     -1.301986
RAD     -0.483524
TAX     -0.439375
PTRATIO  0.294189
B        0.024310
LSTAT    4.638906

Alternatively, if you wish everything arranged row-wise:

pd.DataFrame(
    np.c_[sv.base_values, sv.values],
    columns = ["bv"] + list(X.columns)
)
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Is this for global explanation? I want to know how can we export the local explanation? Upvoted for the help – The Great Mar 24 '22 at 23:41
  • 1
    This is local, in the sense the point `0` is explained. You'll get global, if you aggregate sv by means of mean(abs) or whatever. – Sergey Bushmanov Mar 25 '22 at 03:06
  • what does `background = shap.maskers.Independent(X, max_samples=100)` mean? – The Great Mar 25 '22 at 03:17
  • Your answer is very detailed and useful. If you can only help me understand what do you mean by `Independent`, `maskers` and `max_Samples` that would really be useful – The Great Mar 25 '22 at 03:18
  • SHAPis calculation intensive in general, this is one of the reasons why mask sampler was introduced. `max_samples` is sampling param. In general it's believed 100 is good enough. This is what will be used if you supply background (without warning you). If you wanna manage this figure yourself you pass data through masker. – Sergey Bushmanov Mar 25 '22 at 03:27
  • 2
    As far as masker type is involved, your model in general is as good as data collected. To try to over overcome this SHAP introduces 2 types of maskers (data samplers if you remember): independent and interventional. If this is interesting for you I advise googling for "true to model or true to data" and come back with another question if something is not clear. – Sergey Bushmanov Mar 25 '22 at 03:30
  • See update for all locals in one df – Sergey Bushmanov Mar 25 '22 at 03:31
  • And `background` means what data is used to infer/calculate SHAP values – Sergey Bushmanov Mar 25 '22 at 03:40
  • So, max_samples mean, it uses neighbouring 100 or 500 samples to calculate the shapley values? instead of full datset – The Great Mar 25 '22 at 03:51
  • No, it uses 100 data points as is and averages marginal feature contributions over that sample. Notion of neighbourhood -- used in lime -- is not applicable to SHAP. – Sergey Bushmanov Mar 25 '22 at 03:54
  • Ah fantastic. Understand. you are awesome. It just randomly samples 100 datapoints from our dataset and use that to calculate the feature contribution. So, do you think that 100 datapoints is enough to generalize for full dataset? Of course, I know we can increase it to 500 or 1000 as well. Is there any inutition to choose the right number? It's piurely computational? More computational powe,r I choose full dataset as well? – The Great Mar 25 '22 at 03:56
  • Good question! What not ask it as a question? ;) – Sergey Bushmanov Mar 25 '22 at 03:58
  • Your aim here is to try to understand which features on average move outcome in which direction. You may use the whole dataset by feeding it through masker and specifying `max_samples = X.shape[0]`, but to speed things up it's believed 100 is enough (which is silently done if you supply background as `X`, without masker). Why 100 is enough? Because they converge (as usual due to CLT). – Sergey Bushmanov Mar 26 '22 at 04:46
  • 1st line in your response is towards my question on "SHAP Value range"? You mean to say that we don't need to worry about output range/feature contribution range? or there is no such thing as range at all. – The Great Mar 26 '22 at 06:38
  • I'm answering your question should we be more precise about SHAP values and wouldn't it be better to use 1'000 instead of 100 – Sergey Bushmanov Mar 26 '22 at 06:54