3

I need to understand my LightGBM model better, so I am using SHAP Tree explainer. The lightgbm needs the data to be encoded and I am passing the same data to the tree explainer. So, I am worried that the SHAP TreeExplainer and shap_values() are treating my data as numeric data. How to specify that the data is categorical? Does that change the SHAP Values calculation?

I have already gone through the documentation.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
sameershah141
  • 338
  • 4
  • 7

1 Answers1

7

shap cannot handle features of type object. Just make sure that your continuous variables are of type float and your categorical variables of type category.


for cont in continuous_variables:
    df[cont] = df[cont].astype('float64')

for cat in categorical_variables:
    df[cat] = df[cat].astype('category')

and finally, you also need to make sure that you provide the corresponding values in the parameters:

params = {
    'objective': "binary", 
    'num_leaves': 100, 
    'num_trees': 500, 
    'learning_rate': 0.1, 
    'tree_learner': 'data', 
    'device': 'cpu', 
    'seed': 132, 
    'max_depth': -1, 
    'min_data_in_leaf': 50, 
    'subsample': 0.9, 
    'feature_fraction': 1, 
    'metric': 'binary_logloss', 
    'categorical_feature': ['categoricalFeature1', 'categoricalFeature2']
}

bst = lgbm.Booster(model_file='model_file.txt')
tree_explainer = shap.TreeExplainer(bst)
tree_explainer.model.original_model.params = params

shap_values_result = tree_explainer.shap_values(df[features], y=df[target])

Alternatively, you might choose to apply Label Encoding over your categorical features. For example,

df['categoricalFeature'] = df['categoricalFeature'].astype('category')
df['categoricalFeature'] = df['categoricalFeature'].cat.codes

As a note, make sure that you can reproduce this mapping so that you can transform validation/test datasets in the same way as well.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156