Error when calculating variable importance with categorical variables using the caret package (varImp)

Question

I've been trying to compute the variable importance for a model with mixed scale features using the varImp function in the caret package. I've tried a number of approaches, including renaming and coding my levels numerically. In each case, I am getting the following error:

Error in auc3_(actual, predicted, ranks) : 
  Not compatible with requested type: [type=character; target=double].

The following dummy example should illustrate my point (edited to reflect @StupidWolf's correction):

library(caret)

#create small dummy dataset
set.seed(124)
dummy_data = data.frame(Label = factor(sample(c("a","b"),40, replace = TRUE)))
dummy_data$pred1 = ifelse(dummy_data$Label=="a",rnorm(40,-.5,2),rnorm(40,.5,2))
dummy_data$pred2 = factor(ifelse(dummy_data$Label=="a",rbinom(40,1,0.3),rbinom(40,1,0.7)))


# check varImp
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
model.lvq <- caret::train(Label~., data=dummy_data, 
                          method="lvq", preProcess="scale", trControl=control.lvq)
varImp.lvq <- caret::varImp(model.lvq, scale=FALSE)

The issue persists when using different models (like randomForest and SVM).

If anyone knows a solution or can tell me what is going wrong, I would highly appreciate that.

Thanks!

StupidWolf · Answer 1 · 2021-03-01T15:08:34.587

2

When you call varImp on lvq , it defaults to filterVarImp() because there is no specific variable importance for this model. Now if you check the help page:

For two class problems, a series of cutoffs is applied to the predictor data to predict the class. The sensitivity and specificity are computed for each cutoff and the ROC curve is computed.

Now if you read the source code of varImp.train() that feeds the data into filterVarImp(), it is the original dataframe and not whatever comes out of the preprocess.

This means in the original data, if you have a variable that is a factor, it cannot cut the variable, it will throw and error like this:

filterVarImp(data.frame(dummy_data$pred2),dummy_data$Label)
Error in auc3_(actual, predicted, ranks) : 
  Not compatible with requested type: [type=character; target=double].

So using my example and like you have pointed out, you need to onehot encode it:

set.seed(111)
dummy_data = data.frame(Label = rep(c("a","b"),each=20))
dummy_data$pred1 = rnorm(40,rep(c(-0.5,0.5),each=20),2)
dummy_data$pred2 = rbinom(40,1,rep(c(0.3,0.7),each=20))
dummy_data$pred2 = factor(dummy_data$pred2)

control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)

ohe_data = data.frame(
            Label = dummy_data$Label,
            model.matrix(Label ~ 0+.,data=dummy_data))

model.lvq <- caret::train(Label~., data=ohe_data, 
                          method="lvq", preProcess="scale",
                       trControl=control.lvq)

caret::varImp(model.lvq, scale=FALSE)  

ROC curve variable importance

       Importance
pred1      0.6575
pred20     0.6000
pred21     0.6000

If you use a model that doesn't have a specific variable importance method, then one option is that you can already calculate the variable importance first, and run the model after that.

edited Mar 01 '21 at 15:08

answered Feb 27 '21 at 10:28

StupidWolf

45,075
17
40
72

Thanks, @StupidWolf, for pointing out the problem with the predictors. I ran your example. However, I am still getting an error message: ``` Error in y - mean(y, rm.na = TRUE) : non-numeric argument to binary operator In addition: Warning message: In mean.default(y, rm.na = TRUE) : argument is not numeric or logical: returning NA ``` Since it appears to work for you, could it be a version problem? I am running on R version 4.0.3 and caret is on version 6.0-86. After setting the Label and pred2 to factors, I get the same error as described in my original post. – hanibal Mar 01 '21 at 13:01
Nothing to do with R versions.. check what you are doing next time @hanibal – StupidWolf Mar 01 '21 at 14:16
Ok. I was hoping caret was able to handle categorical variables internally and compute a single variable importance for categorical features. Thanks for clarifying that this does not seem to be the case. I elaborated that in the answer I posted -- maybe it wasn't too clear since it got pushed down ;-). Thanks for the link. – hanibal Mar 01 '21 at 14:23
I wasn't aware that preProcess was also applied to categorical data. Actually, the issue appears to persist for me after removing the preProcess argument in caret::train (at least without encoding the categorical data first). – hanibal Mar 01 '21 at 14:33
Yup you are right.. Ok I have to read the code in detail. The AUC function used by caret::varImp cannot handle the factor – StupidWolf Mar 01 '21 at 14:41
Commenting to say I am also facing this same problem. While reading "Applied Predictive Modeling" (by Max Kuhn who also wrote filterVarImp), I decided to try out filterVarImp to get some quick variable importance scores during an EDA. I have been getting the same error as above ("Error in auc3_(actual, predicted, ranks) :....etc"). To be clear, is the only solution for this to one-hot encode all categorical features? – Braden Anderson Nov 24 '21 at 18:30
right now yes, if you want to go with caret that is – StupidWolf Nov 24 '21 at 19:49

score 1 · Answer 2 · answered Feb 26 '21 at 14:29

Note that this problem can be circumvented by replacing ordinal features (with d levels) by its (d-1)-dimensional indicator encoding:

model.matrix(~dummy_data$pred2-1)[,1:(length(levels(dummy_data$pred2)-1)]

However, why does varImp not handle this automatically? Further, this has the drawback that it yields an importance score for each of the d-1 indicators, not one unified importance score for the original feature.

Error when calculating variable importance with categorical variables using the caret package (varImp)

2 Answers2