I think your question already contains a hint:
explainer = shap.Explainer(model.predict, X_train)
shap_values = explainer.shap_values(X_test)
is expensive and most probably is a kind of an exact algo to calculate Shapely values out of a function.
explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)
averages readily available predictions from trained model.
To prove the first claim (second is the fact of the matter) let's study source code for Explainer class.
Class definition:
class Explainer(Serializable):
""" Uses Shapley values to explain any machine learning model or python function.
This is the primary explainer interface for the SHAP library. It takes any combination
of a model and masker and returns a callable subclass object that implements
the particular estimation algorithm that was chosen.
"""
def __init__(self, model, masker=None, link=links.identity, algorithm="auto", output_names=None, feature_names=None, linearize_link=True,
seed=None, **kwargs):
""" Build a new explainer for the passed model.
Parameters
----------
model : object or function
User supplied function or model object that takes a dataset of samples and
computes the output of the model for those samples.
So, now you know one can provide either a model or a function as the first argument.
In case Pandas is supplied as masker:
if safe_isinstance(masker, "pandas.core.frame.DataFrame") or \
((safe_isinstance(masker, "numpy.ndarray") or sp.sparse.issparse(masker)) and len(masker.shape) == 2):
if algorithm == "partition":
self.masker = maskers.Partition(masker)
else:
self.masker = maskers.Independent(masker)
Finally, if callable is supplied:
elif callable(self.model):
if issubclass(type(self.masker), maskers.Independent):
if self.masker.shape[1] <= 10:
algorithm = "exact"
else:
algorithm = "permutation"
Hopefully, you see now why the first one is an exact one (and thus takes long to calculate).
Now to your question(s):
What is the correct way to obtain explanations for predictions using Shap?
and
So that leaves me to wonder what I'm actually getting in the second case?
If you have a model (tree, linear, whatever) which is supported by SHAP use:
explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)
These are SHAP values extracted from a model and this is why SHAP
came into existence.
If it's not supported, use 1st one.
Both should give similar results.