0

I have data with class imbalance (the response variable has two classes, one of the classes is significantly more common than the other). Accuracy does not seem to be a good metric to train a model in this situation (I can get 99% accuracy and completely misclassify the minority class). I think that using the F1 score would be more beneficial.

Has anyone ever tried using the F1 score as a training metric in R? I tried modifying the iris data set to make species as a binary variable and run random forest. Could someone please help me debug this?

library(caret)
library(randomForest) 

data(iris)

iris$Species = ifelse(iris$Species == "setosa", "a", "b") 

iris$Species = as.factor(iris$Species) 

f1 <- function (data, lev = NULL, model = NULL) {
                precision <- posPredValue(data$pred, data$obs, positive = "pass")
                recall <- sensitivity(data$pred, data$obs, postive = "pass") 
                f1_val <- (2 * precision * recall) / (precision + recall) 
                names(f1_val) <- c("F1")
                f1_val }


train.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3, 
                              classProbs = TRUE,
                              #sampling = "smote", 
                              summaryFunction = f1,
                             search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1)) 

random.forest.orig <- train(Species ~ ., data = iris,
                            method = "rf",
                            tuneGrid = tune.grid,
                            metric = "F1",
                            trControl = train.control)

Gives the following error:

Something is wrong; all the F1 metric values are missing:
       F1     
 Min.   : NA  
 1st Qu.: NA  
 Median : NA  
 Mean   :NaN  
 3rd Qu.: NA  
 Max.   : NA  
 NA's   :10   
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
5: stop("Stopping", call. = FALSE)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid, 
       metric = "F1", trControl = train.control)
1: train(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid, 
       metric = "F1", trControl = train.control)
> warnings()
Warning messages:
1: In randomForest.default(x, y, mtry = param$mtry, ...) :
  invalid mtry: reset to within valid range

Source: Training Model in Caret Using F1 Metric

missuse
  • 19,056
  • 3
  • 25
  • 47
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • Your are looking at this answer: https://stackoverflow.com/questions/47820750/training-model-in-caret-using-f1-metric. However "pass" is a factor level for the outcome variable there. Your outcome has levels "a" and "b". Perhaps try to make it more general using the lev argument, check this answer - https://stackoverflow.com/questions/53269560/error-when-trying-to-pass-custom-metric-in-caret-package – missuse Sep 27 '20 at 06:03
  • @missuse55 thank you for your reply! Im not sure i follow. I tried having the levels as 0 and 1... but it didnt work. Thank you! – stats_noob Sep 27 '20 at 06:08
  • change `positive = "pass"` to `positive = lev[2]` or `lev[1]` depending on what is the positive class. Your outcome does not have a "pass" level". It is specific to the answer you linked. – missuse Sep 27 '20 at 06:09
  • There is also a function called `prSummary` that you can use to compute that metric. – missuse Sep 27 '20 at 06:10
  • @missuse thank you! I just tried : positive = lev[2] . It seems to be working so far, my data is pretty big. Is there any difference if i had put lev[1]? (I dont think so). How could i use prSummary along with caret train? I have used prSummary in isolation before. I saw this here : https://stackoverflow.com/questions/39783588/prsummary-in-r-caret-package-for-imbalance-data i am not sure if this applies to what i am doing? – stats_noob Sep 27 '20 at 06:17

0 Answers0