8

I used caret to train an rpart model below.

trainIndex <- createDataPartition(d$Happiness, p=.8, list=FALSE)
dtrain <- d[trainIndex, ]
dtest <- d[-trainIndex, ]
fitControl <- trainControl(## 10-fold CV
  method = "repeatedcv", number=10, repeats=10)
fitRpart <- train(Happiness ~ ., data=dtrain, method="rpart",
                trControl = fitControl)
testRpart <- predict(fitRpart, newdata=dtest)

dtest contains 1296 observations, so I expected testRpart to produce a vector of length 1296. Instead it's 1077 long, i.e. 219 short.

When I ran the prediction on the first 220 rows of dtest, I got a predicted result of 1, so it's consistently 219 short.

Any explanation on why this is so, and what I can do to get a consistent output to the input?

Edit: d can be loaded from here to reproduce the above.

Ricky
  • 4,616
  • 6
  • 42
  • 72
  • can you make your example reproducible? – Josh W. Jun 07 '15 at 03:41
  • Have edited to provide link to load `d` above (2.3 MB). Not sure what's the protocol on SO when data to reproduce is reasonably big: I'm putting it up in my Dropbox, which may not be permanent. Is there a better way? – Ricky Jun 07 '15 at 03:48
  • The best way is to use a small dataset so that it can be posted. The behavior you see should be easy to produce with a small subset of your data, or some simulated data. – CoderGuy123 Mar 30 '18 at 19:47

3 Answers3

14

I downloaded your data and found what explains the discrepancy.

If you simply remove the missing values from your dataset, the length of the outputs match:

testRpart <- predict(fitRpart, newdata = na.omit(dtest))

Note nrow(na.omit(dtest)) is 1103, and length(testRpart) is 1103. So you need a strategy to address missing values. See ?predict.rpart and the options for the na.action parameter to choose what you want.

Josh W.
  • 1,123
  • 1
  • 10
  • 17
  • 1
    This is not that helpful as there are many situations in which one needs the missing values in place. It seems that `predict.train` doesn't have a way to deal with this issue. – CoderGuy123 Mar 30 '18 at 19:48
  • 1
    Similar to what Josh mentioned, if you need to generate predictions using `predict.train` from caret, simply pass the `na.action` of `na.pass`: `testRpart <- predict(fitRpart, newdata = dtest, na.action = na.pass)` – davedgd May 21 '18 at 13:37
  • 1
    @davedgd this should be a separate answer! Exactly what I was looking for, adding na.action = na.pass seems like the best solution to me and totally fixed my issues. – Ricky Jul 12 '20 at 07:19
  • 1
    @Ricky: Thanks for the suggestion -- I've gone ahead and added it as a separate answer for visiblity! – davedgd Jul 28 '20 at 17:35
2

Similar to what Josh mentioned, if you need to generate predictions using predict.train from caret, simply pass the na.action of na.pass:

testRpart <- predict(fitRpart, newdata = dtest, na.action = na.pass)

Note: moving this to a separate answer based on Ricky's comment on Josh's answer above for visibility.

davedgd
  • 387
  • 2
  • 6
0

I had a similar issue using "newx" instead of "newdata" in the predict function. Using "newdata" (or nothing) solve my problem, hope it will help someone else who used newx and having same issue.