1

I'm fitting two models with the ranger package and the same seed. The first one predicts the class and the second one returns the probability matrix, my goal is to reach the same result, but I differ in 4 registers. Someone knows the solution. I'm using the maximum probability per class. What should be the cut point?

library(ranger)
library(caret)

## fit model 1
mod <- ranger(formula = Species ~., data = iris, seed = 2020)
res1 <- predict(object = mod, data = iris[,-5])$predictions

## fit model 2
mod2 <- ranger(formula = Species ~., data = iris, probability = TRUE, seed = 2020)
res2 <- factor(ifelse(apply(predict(object = mod2, data = iris[,-5])$predictions, 1, which.max) == 1,"setosa",
       ifelse(apply(predict(object = mod2, data = iris[,-5])$predictions, 1, which.max) == 2, "versicolor", "virginica")),
       levels = c("setosa","versicolor","virginica"))

head(data.frame(res1, res2))
    res1   res2
1 setosa setosa
2 setosa setosa
3 setosa setosa
4 setosa setosa
5 setosa setosa
6 setosa setosa

all.equal(res1, res2)
[1] "4 string mismatches"

My expected output

all.equal(res1, res2)
[1] TRUE
Rafael Díaz
  • 2,134
  • 2
  • 16
  • 32
  • 2
    It looks like when you use `probability=TRUE` it uses a completely different method (Malley et al. 2012) compared to the classification tree (Breiman 2001). Setting the seed won't help since those different methods use random numbers differently. The seed will only be reproducible when using the same method. It's more like you're calling two completely different modeling functions. Perhaps you could clarify your analysis goal and maybe ask for some statistical help at [stats.se] instead. – MrFlick Jul 09 '20 at 02:20

2 Answers2

1

Very interesting question: I am a user of ranger and was not aware of this result.

As stated by @MrFlick in the comment to your answer, you are using two different methods. You can confirm it accessing to the element treetype of mod and mod2:

mod$treetype
"Classification"

mod2$treetype
"Probability estimation"
carlo_sguera
  • 395
  • 2
  • 14
0

There is no cut point to guarantee identical results for your two models. The probability forest is not the same as a classification forest returning an average of the binary votes from each tree. Rather, each tree returns a continuous probability estimate, and then those continuous estimates are averaged to get the ensemble probability prediction. See ranger documentation:

Predictions are class probabilities for each sample. In contrast to other implementations, each tree returns a probability estimate and these estimates are averaged for the forest probability estimate.

The difference is that in a classification forest, each node of each tree is binary, while in a probability forest, each node is continuous.

lead-y
  • 46
  • 4