-1

I have a matrix of features (in columns) where the last column is a class label. Observations are in rows.

I use rpart in R to build a decision tree over a subset of my data and test it with predict using the rest of the data. The code to learn the tree is

fTree <- rpart(feature$a ~ feature$m, data = feature[fold != k, ],
  method = "class", parms = list(split = "gini"))

The code to test it is

predFeature <- predict(fTree, newdata = feature[fold == k, ],
  type = "class")

where k is an integer that I use to select a subset of the data, while fold is a matrix I use to create different subsets.

I get a warning message that I know some of you know already:

'newdata' had 306 rows but variables found have 3063 rows.

I read a post related to this but I failed in understanding the reason. So, further help is appreciated. Thanks in advance.

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
capella
  • 5
  • 1

1 Answers1

1

It is hard to say for sure because your example is not reproducible but I am rather certain that the problem is the following: You have fitted your tree with

rpart(feature$a ~ feature$m, data = feature[fold != k, ], ...)

Thus, the dependent variable is always feature$a from the full feature data set (which apparently has 3063 observation) and not from the subset feature[fold != k, ]. This works without error but is not the tree you wanted to fit. Consequently, the prediction is surprised because the newdata just has 306 observations but then these are not used but still the full data set due to the hard-coded feature$a in the formula.

Using

rpart(a ~ m, data = feature[fold != k, ], ...)

is easier to read, less to type, and should fix the problems you observe.

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • So, when a say a ~ m, I'm saying which var is dependent and which is indep. In this case the data used to fit the tree is feature[fold != k, ]. Likewise predict uses the data in newdata. So rpart and predict use different data sets. However when I use feature$a ~ feature$m, apart from the type of the variable (dep. vs indep.) I am also stating what data shall be used in fitting the tree. This also implies that predict must use the same data set. This is what I understand from your words. I must say that this is a bit weird, the reason why I was stuck at first. You're right and helpful. Nice! – capella Mar 10 '16 at 00:03
  • Yes. The point about formulas in R (statements with ~) is that you conveniently separate the variable description for a model from the actual data used. But by accessing a variable from a specific data set with the $ operator, you overrule that behavior. – Achim Zeileis Mar 10 '16 at 02:07