0

I've been reading and having a go at creating some SHAP plots of some of my models but for the life of me can't find a package that integrates with caret.

I just need to sense check the direction of the features are sensible.

Let's say I build a simple xgboost model like so.

model_1 <- train(
  sales~., 
  data=example_df, 
  method="xgbTree", 
  preProcess=c('center', 'scale', 'zv'), 
  trControl=trainControl(method="repeatedcv", number=5, repeats=2), 
  na.action = na.omit
)

I've done this a few times and done some feature engineering and selection but now want to see shap values. How do I achieve this?

Quixotic22
  • 2,894
  • 1
  • 6
  • 14

1 Answers1

2

The train object returned by caret::train contains an element finalModel which is of the type produced by the method. In this case, it is an xgboost object. You can use all the utilities within the xgboost package on this object.

To plot shap values, call xgb.plot.shap(data = example_df, model = model_1$finalModel, top_n = 15). To get the shap values themselves, you call the same function with plot = FALSE like this shap_values <- xgb.plot.shap(data = example_df, model = model_1$finalModel, top_n = 15, plot = FALSE)

Arthur
  • 1,248
  • 8
  • 14
  • Thanks, knowing that will open a lot of doors. In terms of running the suggestion above I get the following error. `data: must be either matrix or dgCMatrix` So I naturally wrapped the df in `as.matrix` but then get this: `Error in xgb.DMatrix(newdata, missing = missing) : 'data' has class 'character' and length 35728. 'data' accepts either a numeric matrix or a single filename.` So from there I tried to apply some one-hot encoding with `dummyVars` assuming that would be the underlying method but the names don't match those in the model. Any tips for this? – Quixotic22 Apr 04 '22 at 13:47
  • Follow up that I have largely made it work like so. `df_dummy <- dummyVars('~.', data = example_df, sep ='') %>% predict(example_df) %>% as_tibble() %>% select(all_of(model_1$finalModel$feature_names)) %>% as.matrix()` I'm sure there's a better way but at least I've got something – Quixotic22 Apr 04 '22 at 14:13
  • 1
    I think what's happening here is that caret creates a matrix input to xgboost itself internally. You might want to make x matrix and y vector inputs to caret yourself with model.matrix(formula, data) so you can be sure that you're sending the same exact data to `train()` and `xgb.plot.shap()`. In other words, call `train(x, y)` instead of `train(formula, data)`. – Arthur Apr 04 '22 at 17:29