0

I am invoking ranger to model a multi-classification problem of a big mixed-data frame (where some categorical variables have more than 53 levels). Training and Testsing runs without any problem. However, interpretting confusion matrix/ contigency table gives hiccups.

I am using iris data rather to explain the difficulties I am facing, by treating Species as the classification variable,

library(ranger)
library(caret)

# Data
idx = sample(nrow(iris),100)
data = iris

# Split data sets
Train_Set = data[idx,]
Test_Set = data[-idx,]

# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)

# Test
probabilitiesSpecies <- predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)
or
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

the following difficulties are encountered:

table(Test_Set$Species, probabilitiesSpecies$predictions)

Error in table(Test_Set$Species, probabilitiesSpecies$predictions) : 
all arguments must have the same length

or

caret::confusionMatrix(Test_Set$Species, probabilitiesSpecies$predictions)
or
caret::confusionMatrix(table(Test_Set$Species, max.col(probabilitiesSpecies)-1))
gives
Error: `data` and `reference` should be factors with the same levels.

A biclassification shown below, however, works:

idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="virginica",1,0))

Train_Set = data[idx,]
Test_Set = data[-idx,]

# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)

# Test
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

caret::confusionMatrix(table(max.col(probabilitiesSpecies)-1, Test_Set$Species))

How can this issue be tackled for multi-classification to get the confusion matrix? I have posed this as a seperate thread too (Error while computing confusion matrix for multiclassification using ranger)

Ray
  • 321
  • 2
  • 12

1 Answers1

1

In the ranger-documentation, the following is said when probabilities = TRUE,

With the probability option and factor dependent variable a probability forest is grown. Here, the node impurity is used for splitting, as in classification forests. Predictions are class probabilities for each sample. In contrast to other implementations, each tree returns a probability estimate and these estimates are averaged for the forest probability estimate. For details see Malley et al. (2012).

Ie. when set to TRUE, you will get probability estimates which you can then classify according to your own threshold-values. I do not know the default decision rule if set to FALSE, however.

In any case, your approach should be the following,

Species.ranger <- ranger(
        Species ~ .,
        data = Train_Set,
        importance ="impurity",
        save.memory = TRUE, 
        probability = FALSE
)

Which then can be evaluated for performance in the confusionMatrix the following way,

probabilitiesSpecies <- predict(
        Species.ranger,
        data = Test_Set,
        verbose = TRUE
        )

table(
        probabilitiesSpecies$predictions,
        Test_Set$Species
) %>% confusionMatrix()

Output

Confusion Matrix and Statistics

            
             setosa versicolor virginica
  setosa         17          0         0
  versicolor      0         16         1
  virginica       0          0        16

Overall Statistics
                                          
               Accuracy : 0.98            
                 95% CI : (0.8935, 0.9995)
    No Information Rate : 0.34            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.97            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00            1.0000           0.9412
Specificity                   1.00            0.9706           1.0000
Pos Pred Value                1.00            0.9412           1.0000
Neg Pred Value                1.00            1.0000           0.9706
Prevalence                    0.34            0.3200           0.3400
Detection Rate                0.34            0.3200           0.3200
Detection Prevalence          0.34            0.3400           0.3200
Balanced Accuracy             1.00            0.9853           0.9706
Serkan
  • 1,855
  • 6
  • 20
  • Thanks for the work around, I am still wondering how the cases for probability = TRUE can be interpreted via confusion matrix just as you did above by setting probability=FALSE – Ray Aug 06 '21 at 07:58
  • The confusion.matrix appears as integer[3x3](S3:table) as a part of Species.ranger when probability=FALSE. Also predictions are of type factor in both Species.ranger and probabilitiesSpecies, and forest is of type list[9](S3:ranger.forest). In contrary, predictions are of type double in both Species.ranger and probabilitiesSpecies, when probability=TRUE, and forest is of type list[10](S3:forest), with terminal.class.counts appearing additionally. Also, confusion.matrix does not appear under Species.ranger. – Ray Aug 06 '21 at 08:42
  • Im not sure that this is a “workaround”, as it is similar to the example given in the documentation! To get classes with probability = FALSE, Id assume you’d have to specify the thresholds yourself! I dont see any other way around it, but maybe youd benefit from reading the paper. – Serkan Aug 06 '21 at 11:06