0

I am using rpart to get a classification model for my data but I do not know how to allocate the bucket size so as to avoid getting an overfitted or underfitted model. To get the optimal bucket size, I read that using caret's package train method provides a way to get the optimal buckets and hence implemented the few lines in R:

tree <- rpart(y ~ x1 + x2 + x3 + x4 + x5 + x6, method = 'class', data = train, minbucket = 15) - (I have anonymized the formula of my model)
numfolds <- trainControl(method = "cv", number = 10)
cpGrid <- expand.grid(.cp = seq(0.0001, 0.005, 0.0001))
train(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = train, method = "rpart", trControl = numfolds, tuneGrid = cpGrid)

The printout gives:

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0.0024. 

Ok so I heeded and used cp = 0.0024 in my rpart model

treeCV <- rpart(y ~ x1 + x2 + x3 + x4 + x5 + x6, method = 'class', data = train, cp = 0.0024)
prp(treeCV)

I got only a root in the "prp" visualization.

Any help? Please let me know if more information is needed.

ci_
  • 8,594
  • 10
  • 39
  • 63
Guanhua Lee
  • 156
  • 1
  • 12
  • It is most likely to do with your data, you should look at variable importance and/or bivariate plots; is your groups imbalanced? What happens if you don't do CV is there then a model that isn't a root? There are a lot of reasons why the best model found is only the root, and it probably has to do with your data rather than anything else. – chappers May 21 '15 at 05:34
  • I have an initial of 15 predictors that some are categorical and some continuous. I chose these final 6 because they give a good linear regression fit hence I used them. Also, the observations in my data have repeated entries i.e. think of it as multiple visits from the same customer and the outcome each visit is a binary hence my dependent variable. I suspect something to do with this, would I be right? – Guanhua Lee May 21 '15 at 06:17

0 Answers0