I'm using the rpart package in R to create a decision tree model out of a data frame called 'myData'. It has 85,590 rows.
The decision tree is created using code like this (the key part is that 'data = myData'):
decTree <- rpart(outcome ~ var1 + var2 + ..., data = myData, method = "anova", control = rpart.control(minsplit=30))
If I plot & label the 'leaf' (terminal) nodes of this decision tree, I get an initial split of 66,667 on the 'left' of the first node & 18,923 on the right (which add to 85,590, the total number of rows, as expected.)
plot(decTree) # Plot the tree
text(decTree, use.n = TRUE) #Label the tree
The rule that creates this initial split is var1 < 1.5.
BUT, if I count the number of rows in myData where var1 is < 1.5, I get 79,518, rather than the expected 85,590 (and if I count rows where var1 >= 1.5, I get the 'complement' of 6,072, rather than the expected 18,923 shown in the tree.)
length(which(myData$var1 < 1.5))
[1] 79518
I realize that it won't be possible for you to reproduce this behavior on your own (and previous rpart models worked correctly for me in terms of node counts, so not sure why I'm having trouble this time), but I'm hoping someone has had this problem before, or else spots some dumb error in my code...
I tried just rerunning it all again, and still got all the same (mismatching) leaf count numbers.
Also, I checked myData$frame and it is definitely NOT just the 'n = ...' labels that are wrong; the $frame values match what's displayed in the plot (and don't match the counts that I did myself.)
decTree$frame
Lastly, none of the 'var1' values are NA. I.e.:
length(which(is.na(myData$var1)))
[1] 0