rpart package in R - decision tree giving incorrect labels of numbers of rows ('n = ###') in each leaf node

Question

I'm using the rpart package in R to create a decision tree model out of a data frame called 'myData'. It has 85,590 rows.

The decision tree is created using code like this (the key part is that 'data = myData'):

decTree <- rpart(outcome ~ var1 + var2 + ..., data = myData, method = "anova", control = rpart.control(minsplit=30))

If I plot & label the 'leaf' (terminal) nodes of this decision tree, I get an initial split of 66,667 on the 'left' of the first node & 18,923 on the right (which add to 85,590, the total number of rows, as expected.)

plot(decTree) # Plot the tree

text(decTree, use.n = TRUE) #Label the tree

The rule that creates this initial split is var1 < 1.5.

BUT, if I count the number of rows in myData where var1 is < 1.5, I get 79,518, rather than the expected 85,590 (and if I count rows where var1 >= 1.5, I get the 'complement' of 6,072, rather than the expected 18,923 shown in the tree.)

length(which(myData$var1 < 1.5))

[1] 79518

I realize that it won't be possible for you to reproduce this behavior on your own (and previous rpart models worked correctly for me in terms of node counts, so not sure why I'm having trouble this time), but I'm hoping someone has had this problem before, or else spots some dumb error in my code...

I tried just rerunning it all again, and still got all the same (mismatching) leaf count numbers.

Also, I checked myData$frame and it is definitely NOT just the 'n = ...' labels that are wrong; the $frame values match what's displayed in the plot (and don't match the counts that I did myself.)

decTree$frame

Lastly, none of the 'var1' values are NA. I.e.:

length(which(is.na(myData$var1)))

[1] 0

The obvious question, this being a computer and all, is whether the split really is at 1.5 or some slightly different integer, which may affect the way observations fall on one side of the split or not. Take a close look at the splits, say by `print(decTree)`. — Gavin Simpson, Mar 21 '13 at 01:47
@Gavin - I think you're asking if the difference is due to a rounding error. The answer is 'no', but good thought. (But let me know if I misunderstood your comment.) — Ward W, Mar 22 '13 at 06:01
Well I meant that the split is not quite 1.5 but it is *displayed* that way - not just the floating point issue. Next step then, if you are *sure* the split is *exactly* 1.5 is to check what `nrow(na.omit(myData))` returns. It is sometimes not sufficient to just check a single variable for `NA`s. See `?rpart` which has `na.action = na.rpart` where `na.rpart` deletes any row where the response variable is `NA`. So check how many observations in your response are `NA`. — Gavin Simpson, Mar 22 '13 at 06:17

rpart package in R - decision tree giving incorrect labels of numbers of rows ('n = ###') in each leaf node

0 Answers0