I am attempting to get shap values out of an array which was created by
explainer = shap.Explainer(xg_clf, X_train)
shap_values2 = explainer(X_train)
using my XGBoost data, to make a dataframe of feature_names and their SHAP importance, as they would appear in a SHAP bar or summary plot.
Following advice from how to extract the most important feature names? and How to get feature names of shap_values from TreeExplainer? specifically the comment by user Thoo, which shows how the values can be extracted to make a dataframe:
vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()
shap_values has 11595 persons with 595 features each, which I understand is large, but, creating the vals
variable runs very slowly, about 58 minutes on my laptop. It uses almost all RAM on the computer.
After 58 minutes I get an error:
Command terminated by signal 9
which as far as I understand, means that the computer ran out of RAM.
I've tried converting the 2nd line in Thoo's code to
feature_importance = pd.DataFrame(list(zip(X_train.columns,np.abs(shap_values2).mean(0))),columns=['col_name','feature_importance_vals'])
so that vals
isn't stored but this change doesn't reduce RAM at all.
I've also tried a different comment from the same GitHub issue (user "ba1mn"):
def global_shap_importance(model, X):
""" Return a dataframe containing the features sorted by Shap importance
Parameters
----------
model : The tree-based model
X : pd.Dataframe
training set/test set/the whole dataset ... (without the label)
Returns
-------
pd.Dataframe
A dataframe containing the features sorted by Shap importance
"""
explainer = shap.Explainer(model)
shap_values = explainer(X)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
if len(cohort_exps[i].shape) == 2:
cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(
list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(
by=['importance'], ascending=False, inplace=True)
return feature_importance
but global_shap_importance
returns the feature importances in the wrong order, and I don't see how I can alter global_shap_importance
so that the features are returned in the same order as summary_plot
(beeswarm plot).
How can I get the feature importance ranking into a dataframe?