2

I am using a data set of about 54K records and 5 classes(pop) of which one class is insignicant. I am using the caret package and the following to run rpart:

model <- train(pop ~ pe + chl_small, method = "rpart", data = training)

and I get the following tree:

n= 54259 

node), split, n, loss, yval, (yprob)
  * denotes terminal node

1) root 54259 38614 pico (0.0014 0.18 0.29 0.25 0.28)  
 2) pe< 5004 39537 23961 pico (0 0.22 0.39 2.5e-05 0.38)  
  4) chl_small< 32070.5 16948  2900 pico (0 0.00012 0.83 5.9e-05 0.17) *
  5) chl_small>=32070.5 22589 10281 ultra (0 0.39 0.068 0 0.54) *
3) pe>=5004 14722  1113 synecho (0.0052 0.052 0.0047 0.92 0.013) *

It is obvious that node 5 should be further split, but rpart is not doing it. I tried using cp = .001 to cp =.1 and also minbucket = 1000 as additional parameters, but no improvement.

Appreciate any help on this.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
Sanjoy
  • 135
  • 1
  • 3
  • 12
  • 4
    Why do you say that node 5 must be split? If the distribution of the classes within the node is unrelated to the predictors, then there'll be no gain from splitting it. – Hong Ooi Aug 19 '14 at 01:20
  • 3
    Also, if you want to force rpart to split whenever possible, set `cp = -1` (or any negative number). – Hong Ooi Aug 19 '14 at 01:21

1 Answers1

0

Try running the model with an even smaller cp=0.00001 or cp = -1. If it is still not splitting that node then it means that the split will not improve the overall fit.

You can also try changing the splitting criteria from the default Gini impurity to information gain criterion: parms = list(split = "information")

If you do force it to split, it might be a good idea to do a quick check: compare the accuracy of the training vs testing set for the original model and model with small cp.

If the difference between training vs testing is much smaller for the original model then the other model probably overfits the data.

USER_1
  • 2,409
  • 1
  • 28
  • 28