0

I'm using the caret package for a tree model. I understood that caret uses CV to find the optimal tuning parameter for pruning the tree.

This is the code I use:

id2 <- sample(1:nrow(data),2/3*nrow(data))
#learn
app <- data[id2,]
#test
test <- data[-id2,]

ctrl<-trainControl(method="cv", number=8,classProbs=TRUE, summaryFunction=twoClassSummary)
mod0 <- train(class~., data=app,method="rpart",trControl=ctrl,metric="ROC")
plot(mod0)
plot(mod0$finalModel,uniform=TRUE,margin=.1);text(mod0$finalModel,cex=0.8)

Here is my data: https://drive.google.com/open?id=1xrCXTLqKvGiGeo2X0Y1DvoSKvzbYFnyccLimceDIbZg

But everytime I run the code I get trees of different complexities (because of CV?) and the tree is not really pruned but very complex and a lot of terminal nodes.

How can I get a less complex tree ?

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204
Charlotte
  • 391
  • 1
  • 4
  • 16

1 Answers1

3

You need to set the seed prior to calling train to get reproducible results. Also, if you are running in parallel, set the seeds option in trainControl.

As for "complex trees"... that is pretty subjective. Why do you expect them to be more simplistic?

One difference between the results of train and rpart is that the latter uses the "one SE" method for pruning while train prunes to the depth with the best performance. You can use a "one SE" method with train too (see the package website) but I've always found that it tends to be conservative (which was the original point).

Max

topepo
  • 13,534
  • 3
  • 39
  • 52