How can I determine variable importance (vip package in r) for categorical predictors when they have been one-hot encoded? It seems impossible for r to do this when the model is built on the dummy variables rather than the original categorical predictor.
I will demonstrate what I mean with the Ames Housing dataset. I am going to use two categorical predictors. Street (two levels) and Sale.Type (ten levels). I converted them from characters to factors.
library(AmesHousing)
df <- data.frame(ames_raw)
# convert characters to factors
df <- df%>%mutate_if(is.character, as.factor)
# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)
# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)
# Randomly order data:
df_shuffled <- df[permuted_rows, ]
# Identify row to split on: split
split <- round(n_obs * 0.7)
# Create train
train <- df_shuffled[1:split, ]
# Create test
test <- df_shuffled[(split + 1):n_obs, ]
mod_lm <- train(SalePrice ~ Street + Sale.Type,
data = df,
method = "lm")
vip(mod_lm)
The variable importance ranks them by each level, rather than the original predictor. I can see StreetPave is important, but I cannot see if Street is important.