Why is rpart more accurate than Caret rpart in R

Question

This post mentions that Caret rpart is more accurate than rpart due to bootstrapping and cross validation:

Why do results using caret::train(..., method = "rpart") differ from rpart::rpart(...)?

Although when I compare both methods, I get an accuracy of 0.4879 for Caret rpart and 0.7347 for rpart (I have copied my code below).

Besides that the classificationtree for Caret rpart has only a few nodes (splits) compared to rpart

Does anyone understand these differences?

Thank you!

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Loading libraries and the data

This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.

```{r section1, echo=TRUE}

# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)

# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest  <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download the datasets
training <- read.csv(url(wwwTrain))
testing  <- read.csv(url(wwwTest))

# create a partition with the training dataset 
inTrain  <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet  <- training[-inTrain, ]
dim(TrainSet)

# set seed for reproducibility        
set.seed(12345)

```
## Cleaning the data

```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet  <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)

# remove variables that are mostly NA
AllNA    <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet  <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)

# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet  <- TestSet[, -(1:5)]
dim(TrainSet)


```

## Prediction modelling

First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}

mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)

mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)

```

Second we build a similar model using rpart:
```{r section7, echo=TRUE}

# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)

# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree

```

In the future please try to limit your example only to the necessary code. No one really likes downloading 50 mb of .csv files when the problem could be show with an inbuilt data set. — missuse, Apr 15 '18 at 12:07

score 4 · Accepted Answer · answered Apr 15 '18 at 12:05

A simple explanation is that you did not tune either models, and at the default settings rpart performed better by pure chance.

When you do use the same parameters then you should expect the same performance.

Lets do some tuning with caret:

set.seed(1)
mod_rpart <- train(classe ~ .,
                   method = "rpart",
                   data = TrainSet,
                   tuneLength = 50, 
                   metric = "Accuracy",
                   trControl = trainControl(method = "repeatedcv",
                                            number = 4,
                                            repeats = 5,
                                            summaryFunction = multiClassSummary,
                                            classProbs = TRUE))

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 4359  243   92  135   38
         B  446 2489  299  161  276
         C  118  346 2477  300   92
         D  190  377  128 2240  368
         E  188  152  254  219 2652

Overall Statistics

               Accuracy : 0.7628          
                 95% CI : (0.7566, 0.7688)
    No Information Rate : 0.2844          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.7009          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.8223   0.6900   0.7622   0.7332   0.7741
Specificity            0.9619   0.9214   0.9444   0.9318   0.9466
Pos Pred Value         0.8956   0.6780   0.7432   0.6782   0.7654
Neg Pred Value         0.9316   0.9253   0.9495   0.9469   0.9490
Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2339   0.1335   0.1329   0.1202   0.1423
Detection Prevalence   0.2611   0.1970   0.1788   0.1772   0.1859
Balanced Accuracy      0.8921   0.8057   0.8533   0.8325   0.8603

that is a bit better then rpart with default settings (cp = 0.01)

how about if we set the optimal cp as chosen by caret:

modFitDecTree <- rpart(classe ~ .,
                       data = TrainSet,
                       method = "class",
                       control = rpart.control(cp = mod_rpart$bestTune))

predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628

Thank you for your explanation. I am just a beginner in ML with no experience in tuning. I understand you can make both outcomes similar by using the same parameters. Do I understand correct that the original outcome for Caret was lower due to randomness (although it uses cross validation compared to rpart which is not using cross validation in my example) ? — user2165379, Apr 15 '18 at 12:16
@user2165379 - it's not "randomness" per se, but the fact that the default settings for `rpart` parameters in `caret::train()` are different than the default settings in the `rpart` package that caused the original difference you saw in the results. Note that you'll need to improve your model beyond a .76 accuracy in order to get 20 out of 20 on the quiz for this project, as I describe in [Predicting Test Results based on Model Accuracy](https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-requiredModelAccuracy.md). — Len Greski, Apr 15 '18 at 12:23
@user2165379 a really simple way to improve accuracy is to change "rpart" to "rf" and tune `mtry` and `ntree`. — missuse, Apr 15 '18 at 12:53
@missuse Thanks. I have changed the createDataPartition for inTrain from p=0.05 to p=0.70. I have left your tuning settings the same. The result is an extreme detailed tree with more than 200 nodes (splits). The accuracy is very good with 0.9641. Can this result be correct? (I admit I have to study the subject of tuning more.). ps: I have an rf-model too, although I would like to understand rpart. Thanks a lot! — user2165379, Apr 15 '18 at 17:39
the more data you use for training the better the model will generalize which leads to higher accuracy. — missuse, Apr 15 '18 at 17:48
@ I understand, although do you think it is correct that the tree has so many nodes? — user2165379, Apr 15 '18 at 18:20
It is not usual. And I trust you could get better performance if you increased the cp to reduce over-fitting. Again you should tune the optimal cp in caret. With the 0.05 split I received 94% accuracy using random forests with mtry = 18, and now I am attempting `xgboost` with bayesian optimization will post the result when its done. — missuse, Apr 15 '18 at 18:29
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169037/discussion-between-user2165379-and-missuse). — user2165379, Apr 15 '18 at 18:46

Why is rpart more accurate than Caret rpart in R

1 Answers1