1

I am attempting to get shap values out of an array which was created by

explainer = shap.Explainer(xg_clf, X_train)
shap_values2 = explainer(X_train)

using my XGBoost data, to make a dataframe of feature_names and their SHAP importance, as they would appear in a SHAP bar or summary plot.

Following advice from how to extract the most important feature names? and How to get feature names of shap_values from TreeExplainer? specifically the comment by user Thoo, which shows how the values can be extracted to make a dataframe:

vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()

shap_values has 11595 persons with 595 features each, which I understand is large, but, creating the vals variable runs very slowly, about 58 minutes on my laptop. It uses almost all RAM on the computer.

After 58 minutes I get an error: Command terminated by signal 9

which as far as I understand, means that the computer ran out of RAM.

I've tried converting the 2nd line in Thoo's code to

feature_importance = pd.DataFrame(list(zip(X_train.columns,np.abs(shap_values2).mean(0))),columns=['col_name','feature_importance_vals'])

so that vals isn't stored but this change doesn't reduce RAM at all.

I've also tried a different comment from the same GitHub issue (user "ba1mn"):

def global_shap_importance(model, X):
    """ Return a dataframe containing the features sorted by Shap importance
    Parameters
    ----------
    model : The tree-based model 
    X : pd.Dataframe
         training set/test set/the whole dataset ... (without the label)
    Returns
    -------
    pd.Dataframe
        A dataframe containing the features sorted by Shap importance
    """
    explainer = shap.Explainer(model)
    shap_values = explainer(X)
    cohorts = {"": shap_values}
    cohort_labels = list(cohorts.keys())
    cohort_exps = list(cohorts.values())
    for i in range(len(cohort_exps)):
        if len(cohort_exps[i].shape) == 2:
            cohort_exps[i] = cohort_exps[i].abs.mean(0)
    features = cohort_exps[0].data
    feature_names = cohort_exps[0].feature_names
    values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
    feature_importance = pd.DataFrame(
        list(zip(feature_names, sum(values))), columns=['features', 'importance'])
    feature_importance.sort_values(
        by=['importance'], ascending=False, inplace=True)
    return feature_importance

but global_shap_importance returns the feature importances in the wrong order, and I don't see how I can alter global_shap_importance so that the features are returned in the same order as summary_plot (beeswarm plot).

How can I get the feature importance ranking into a dataframe?

con
  • 5,767
  • 8
  • 33
  • 62

2 Answers2

0

I pulled this straight from the source code. Confirmed identical to the summary_plot.

def shapley_feature_ranking(shap_values, X):
    feature_order = np.argsort(np.mean(np.abs(shap_values), axis=0))
    return pd.DataFrame(
        {
            "features": [X.columns[i] for i in feature_order][::-1],
            "importance": [
                np.mean(np.abs(shap_values), axis=0)[i] for i in feature_order
            ][::-1],
        }
    )
  • If called on shapley_feature_ranking(shap_values, X) `features importance 0 Index(['feature_271', 'feature_239', 'feature_... [[0.0025582758, 0.0012526097, 0.0028011668, 0.... 1 Index(['feature_144', 'feature_136', 'feature_... [[0.0024217064, 0.0016129484, 0.0066467705, 0....` if called on shapley_feature_ranking(shap_values[0], X) gives the wrong order – MosQuan Aug 28 '23 at 11:38
0

More utilization of numpy will save much of computational time

def get_shap_ranking(shap_values,
                     X:pd.DataFrame)->list:
    '''For multiclass'''
    shap_values_aggregated = np.stack(shap_values, 
                                      axis=2)  # last dim number is equal to number of classes

    # Calculate the absolute sum across observations for each feature and class
    abs_sum_per_feature_class = np.abs(shap_values_aggregated).sum(axis=(0, 2))  # (310,)

    # Calculate the mean absolute sum across features
    mean_abs_sum_per_feature = abs_sum_per_feature_class.mean()

    # Create a dictionary mapping feature names to mean absolute sum values
    feature_abs_sum_dict = {feature_name: abs_sum for feature_name, abs_sum in zip(data.columns, abs_sum_per_feature_class)}

    # Sort features based on mean absolute sum values
    sorted_features = sorted(feature_abs_sum_dict.items(), key=lambda x: x[1], reverse=True)

    return sorted_features

The order is the same as in the plot and the values are proportional to the width of corresponding rows in the plot.

You can then to dataframe with

pd.DataFrame(sorted_features, columns=["feature", "importance"])
MosQuan
  • 85
  • 1
  • 11