2

I am running across the issue while working on my real data. Here is a reproducible example with some simulated data:

library(caret)
dummy <- cbind.data.frame(y = factor(rep(c("yes", "no"), each = 50)), 
    x = factor(rep(c("A", "B"), each = 50))
)
dummyFolds <- caret::createFolds(dummy$y, 3, returnTrain = TRUE)
dummyTc <- caret::trainControl(index = dummyFolds, method = "cv", number = 3, 
    summaryFunction = caret::twoClassSummary, classProbs = TRUE
)
dummyModel <- caret::train(y ~ x, data = dummy, method = "rpart", metric = "ROC",
    trControl = dummyTc, tuneLength = 5
)
dummyModel$finalModel

The last line prints:

n= 100 

    node), split, n, loss, yval, (yprob)
        * denotes terminal node

    1) root 100 50 no (0.5000000 0.5000000)  
    2) xB>=0.5 50  0 no (1.0000000 0.0000000) *
    3) xB< 0.5 50  0 yes (0.0000000 1.0000000) *

Now directly with Rpart:

library(rpart)
dummyModel2 <- rpart::rpart(y ~ x, data = dummy)
dummyModel2

Last line prints:

n= 100 

node), split, n, loss, yval, (yprob)
    * denotes terminal node

1) root 100 50 no (0.5000000 0.5000000)  
2) x=B 50  0 no (1.0000000 0.0000000) *
3) x=A 50  0 yes (0.0000000 1.0000000) *

As you can see, when using Rpart through caret the predictor at some point gets interpreted as a numerical variable (x >= or < 0.5 shows as the node rule). Directly with Rpart we observe the expected formatting of the rule (x = some factor level). Why?

phiver
  • 23,048
  • 14
  • 44
  • 56
MikeKatz45
  • 545
  • 5
  • 16

0 Answers0