-1

I'm trying to use R's gbm regression model. I want to compute the coefficient of determination (R squared) between the cross validation predicted response values and the true response values. However, the cv.fitted values of the gbm.object only provides the predicted response values for 1-train.fraction. So in order to get what I want I need to find which of the observations correspond to the cv.fitted values.

Any idea how to get that information?

dan
  • 6,048
  • 10
  • 57
  • 125

1 Answers1

3

You can use the predict function to easily get at model predictions, if I'm understanding your question correctly.

dat <- data.frame(y = runif(1000), x=rnorm(1000))

gbmMod <- gbm::gbm(y~x, data=dat, n.trees=5000, cv.folds=0)

summary(lm(predict(gbmMod, n.trees=5000) ~ dat$y))$adj.r.squared

But shouldn't we hold data to the side and assess model accuracy on test data? This would correspond to the following, where I partition the data into a training set (70%) and testing set (30%):

inds <- sample(1:nrow(dat), 0.7*nrow(dat))

train <- dat[inds, ]
test <- dat[-inds, ]

gbmMod2 <- gbm::gbm(y~x, data=train, n.trees=5000)

preds <- predict(gbmMod2, newdata = test, n.trees=5000)

summary(lm(preds ~ test[,1]))$adj.r.squared

It's also worth noting that the number of trees in the gbm can be tuned using the gbm.perf function and the cv.folds argument to the gbm function. This helps avoids overfitting.

Tad Dallas
  • 1,179
  • 5
  • 13