0

I am trying to create a decision tree using the rpart package in R. To arrive at the optimal depth for the tree I am using the plotcp function. When I use printcp to analyze the results of the cross validation, among other details, I get the following message:

Root node error: 3599.8/14399 = 0.25

My classes are unbalanced (Class 1-75%,Class 2-25%). So what rpart seems to be doing, is to use a default threshold of 0.5. And since none of the nodes have a prob > 0.5 for class C2 they are all getting classified as C1.

Is it not possible for me to specify the probability threshold? Say, for e.g, if prob > 0.35 for C2, classify it as C2.

Jeff
  • 12,555
  • 5
  • 33
  • 60
Dataminer
  • 1,499
  • 3
  • 16
  • 21

1 Answers1

0

The message that you are getting:

Root node error: 3599.8/14399 = 0.25

is not an error. It is part of the standard output of 'printcp' and is simply showing the average error per observation in your data. Presumably you have 14,399 observations. If you are doing classification, then the error measure that is being used could be GINI. You tree may well be doing fine - we cannot see because you have not posted the rest of your CP table.

It is also true, that if you are using classification (e.g. in rpart the subject of your formula is a factor, or you have used written method = 'class'), then the classification of observations falling into each node is according to the majority. And indeed, if all leaf nodes have a majority in the same class, then everything that gets sent down your tree will be classified according to that class. You can look into using weights to encourage different behaviour.

Alan Chalk
  • 300
  • 2
  • 8