It seems to me that I’ve discovered a bug in the performance of the predict() function for method=gbm in the Caret package in R. I'm curious to know if others agree, or if someone has an explanation for the behavior of this function.
1. Generate data
library(caret)
x1 <- rnorm(100)
x2 <- rnorm(100, 2)
y <- x1 + x2 + rnorm(100)
df <- data.frame(x1=x1, x2=x2, y=y)
2. Predict using method="lm"
The following code works as expected: using method=“lm” the two predicted values match. In the first case, p1, “y” is included in newdata, in the second case, p2, it is not.
tempd <- df[1:99, c("y", "x1", "x2") ]
newdata <- df[100, c("y", "x1", "x2")]
lm.fit <- train(y~x1 + x2, data=tempd, method="lm")
p1 <- predict(lm.fit$finalModel, newdata=newdata)
newdata <- df[100, c("x1", "x2")]
p2 <- predict(lm.fit$finalModel, newdata=newdata)
p1 should equal p2, and does:
p1==p2
3. Predict using method="gbm"
This code does not work as expected: using method=“gbm,” with the identical set up, the two predicted values do not match.
tempd <- df[1:99, c("y","x1","x2")]
newdata <- df[100, c("y","x1","x2")]
gbm.fit <- train(y~x1+x2 , data=tempd, method="gbm", verbose=F)
p1 <- predict(gbm.fit$finalModel, newdata=newdata,
n.trees=gbm.fit$finalModel$tuneValue$n.trees,
interaction.depth=gbm.fit$finalModel$tuneValue$interaction.depth,
shrinkage=gbm.fit$finalModel$tuneValue$shrinkage)
newdata <- df[100, c("x1","x2")]
p2 <- predict(gbm.fit$finalModel, newdata=newdata,
n.trees=gbm.fit$finalModel$tuneValue$n.trees,
interaction.depth=gbm.fit$finalModel$tuneValue$interaction.depth,
shrinkage=gbm.fit$finalModel$tuneValue$shrinkage)
In this case, p1 does not equal p2:
p1==p2
4. Predict using method="gbm" with a different set up
BUT, curiously, with one small change—not explicitly naming the variables in the subset operation--it does work:
tempd <- df[1:99, ]
newdata <- df[100, ]
gbm.fit <- train(y~x1+x2 , data=tempd, method="gbm", verbose=F)
p1 <- predict(gbm.fit$finalModel, newdata=newdata,
n.trees=gbm.fit$finalModel$tuneValue$n.trees,
interaction.depth=gbm.fit$finalModel$tuneValue$interaction.depth,
shrinkage=gbm.fit$finalModel$tuneValue$shrinkage)
newdata <- df[100, c("x1","x2")]
p2 <- predict(gbm.fit$finalModel, newdata=newdata,
n.trees=gbm.fit$finalModel$tuneValue$n.trees,
interaction.depth=gbm.fit$finalModel$tuneValue$interaction.depth,
shrinkage=gbm.fit$finalModel$tuneValue$shrinkage)
p1==p2
Thanks in advance for our thoughts.
Jeff