I have data with class imbalance (the response variable has two classes, one of the classes is significantly more common than the other). Accuracy does not seem to be a good metric to train a model in this situation (I can get 99% accuracy and completely misclassify the minority class). I think that using the F1 score would be more beneficial.
Has anyone ever tried using the F1 score as a training metric in R? I tried modifying the iris data set to make species as a binary variable and run random forest. Could someone please help me debug this?
library(caret)
library(randomForest)
data(iris)
iris$Species = ifelse(iris$Species == "setosa", "a", "b")
iris$Species = as.factor(iris$Species)
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val }
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
#sampling = "smote",
summaryFunction = f1,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
random.forest.orig <- train(Species ~ ., data = iris,
method = "rf",
tuneGrid = tune.grid,
metric = "F1",
trControl = train.control)
Gives the following error:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :10
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
5: stop("Stopping", call. = FALSE)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid,
metric = "F1", trControl = train.control)
1: train(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid,
metric = "F1", trControl = train.control)
> warnings()
Warning messages:
1: In randomForest.default(x, y, mtry = param$mtry, ...) :
invalid mtry: reset to within valid range