I am running across the issue while working on my real data. Here is a reproducible example with some simulated data:
library(caret)
dummy <- cbind.data.frame(y = factor(rep(c("yes", "no"), each = 50)),
x = factor(rep(c("A", "B"), each = 50))
)
dummyFolds <- caret::createFolds(dummy$y, 3, returnTrain = TRUE)
dummyTc <- caret::trainControl(index = dummyFolds, method = "cv", number = 3,
summaryFunction = caret::twoClassSummary, classProbs = TRUE
)
dummyModel <- caret::train(y ~ x, data = dummy, method = "rpart", metric = "ROC",
trControl = dummyTc, tuneLength = 5
)
dummyModel$finalModel
The last line prints:
n= 100
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 100 50 no (0.5000000 0.5000000)
2) xB>=0.5 50 0 no (1.0000000 0.0000000) *
3) xB< 0.5 50 0 yes (0.0000000 1.0000000) *
Now directly with Rpart:
library(rpart)
dummyModel2 <- rpart::rpart(y ~ x, data = dummy)
dummyModel2
Last line prints:
n= 100
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 100 50 no (0.5000000 0.5000000)
2) x=B 50 0 no (1.0000000 0.0000000) *
3) x=A 50 0 yes (0.0000000 1.0000000) *
As you can see, when using Rpart through caret the predictor at some point gets interpreted as a numerical variable (x >= or < 0.5 shows as the node rule). Directly with Rpart we observe the expected formatting of the rule (x = some factor level). Why?