2

How can I determine variable importance (vip package in r) for categorical predictors when they have been one-hot encoded? It seems impossible for r to do this when the model is built on the dummy variables rather than the original categorical predictor.

I will demonstrate what I mean with the Ames Housing dataset. I am going to use two categorical predictors. Street (two levels) and Sale.Type (ten levels). I converted them from characters to factors.

library(AmesHousing)
df <- data.frame(ames_raw)

# convert characters to factors 
df <- df%>%mutate_if(is.character, as.factor)

# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: 
df_shuffled <- df[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- df_shuffled[1:split, ]

# Create test
test <- df_shuffled[(split + 1):n_obs, ]

mod_lm <- train(SalePrice ~ Street + Sale.Type,
            data = df,
            method = "lm")

vip(mod_lm)

enter image description here

The variable importance ranks them by each level, rather than the original predictor. I can see StreetPave is important, but I cannot see if Street is important.

mapleleaf
  • 758
  • 3
  • 8
  • 14
  • Hi, can you provide a minimal reproducible example? – riccardo-df Jan 18 '22 at 21:43
  • @PlasticMan Added – mapleleaf Jan 19 '22 at 14:18
  • 2
    One easy approach would be to sum the importance of all factor levels to get the importance of the original variable. – missuse Jan 19 '22 at 14:21
  • 1
    I was going to suggest the same answer provided by @missuse. Variable importance is nothing but the increase in fit of the tree after each split, i.e., a number. So, just sum up the variable importance of all the dummies representing, say, `Sale.Type`, and use results for plotting. Should I post this as answer? – riccardo-df Jan 19 '22 at 14:33
  • Sorry, I meant the increase in fit associated with each variable. My brain automatically thought about decision trees. – riccardo-df Jan 19 '22 at 14:40
  • @PlasticMan Yes, please post as the answer. – mapleleaf Jan 19 '22 at 19:27

1 Answers1

1

From the caret documentation, we see that variable importance in linear models corresponds to the absolute value of the t-statistic for each covariate. So, we can manually compute it, as I do in the code below.

lm() automatically converts categorical variables as dummies. So, to get the importance of each covariate, we have to sum over dummies. I did not find a way to automate this, so if you want to apply my solution to a different set of variables, you need to be careful in choosing the items of t.stats to be summed.

Finally, we can use results for plotting. I just used the baseline function for a bar plot, but you can customize it as you want (maybe also using the ggplot2 package for better visualization).

Ps when you provide a reproducible example, remember to load all the needed packages.

Pps summing over dummies may be sensitive to the base level of the dummy we are using (i.e., the level we omit from the regression). I do not know if that could be an issue.

library(AmesHousing)
library(caret)
library(dplyr)

df = data.frame(ames_raw)

# convert characters to factors
df = df%>%mutate_if(is.character, as.factor)

# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: 
df_shuffled <- df[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- df_shuffled[1:split, ]

# Create test
test <- df_shuffled[(split + 1):n_obs, ]

mod_lm <- train(SalePrice ~ Street + Sale.Type,
                data = df,
                method = "lm")

# Manually computing variable importance from t-statistics of the model.
t.stats = coef(summary(mod_lm))[, "t value"]
imp.sale = sum(t.stats[-(1:2)])
imp.street = t.stats[2]

# Plotting.
barplot(c(imp.sale, imp.street), names.arg = c("Sale", "Street"))
riccardo-df
  • 512
  • 4
  • 9