0

I want to perform a multi-class classification in the caretpackage. Below is a minimum example.

library(caret)   
library(randomForest)


x <- data.frame("A"=seq(1,100), "B"=seq(1,100), "C"="class1")
x[,"C"] <- as.character(x[,"C"])
x[1,"C"] <- "class2"
x[2,"C"] <- "class3"
x[3,"C"] <- "class4"
x[4,"C"] <- "class5"
x[5,"C"] <- "class6"
x[6,"C"] <- "class7"
x[7,"C"] <- "class8"
x[8,"C"] <- "class9"
x[9,"C"] <- "class10"
x[10,"C"] <- "class11"
x[11,"C"] <- "class12"
x[,"C"] <- as.factor(x[,"C"])

control <- trainControl(method="repeatedcv", number=10, repeats=1,   search="grid")     set.seed(5)     tunegrid <- expand.grid(.mtry=c(1:2))     fit <- train(x=x[,1:2], y=x$C, method="rf", metric=metric,   tuneGrid=tunegrid, trControl=control)  
print(fit)  
plot(fit)

When running the code I get an error stating 1: model fit failed for Fold2.Rep1: mtry=1 Error in randomForest.default(x, y, mtry = param$mtry, ...) : Can't have empty classes in y.

Related posts suggest that this is due to unaccounted factors in the response variable - which is not taken account of in resampling. Typically, one runs into the problem, if there is a higher number of classes to be predicted (and little observations).

Is there any workaround to change the caret package such that the missing factors are removed in the resampling methods (e.g., by droplevels())?

jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
BjoSch
  • 47
  • 6
  • A solution is to use stratified sampling. It wont work in your synthetic example since you have several classes with just one row. [Here](https://stackoverflow.com/questions/35907477/caret-package-stratified-cross-validation-in-train-function) is a possible solution. – missuse Nov 13 '17 at 19:01
  • Agreed, but that of course is "another" sampling method. Any idea if this is a known "issue" or intended behavior? – BjoSch Nov 13 '17 at 19:34
  • I trust this is intended behavior since you should not be able to make a model deficit in some classes during cross validation. – missuse Nov 13 '17 at 20:54

0 Answers0