I am trying to do model selection and want to retrieve mean RMSE from 10-fold cross validation. for some models it is possible to use the train() function from the caret package, however for other models I want to look at I have found a manual way to do k-fold cross validation here: https://www.r-bloggers.com/2016/06/bootstrap-and-cross-validation-for-evaluating-modelling-strategies/
The RMSE for cross-validated models are however more different than I would expect. below is the code where I apply different methods of retrieving the RMSE on the same model
library(caret)
library(datasets)
exd<-warpbreaks
##k-fold cross validation
Repeats <- 100
cv_repeat_num <- Repeats / 10
the_control <- trainControl(method = "repeatedcv", number = 10, repeats = cv_repeat_num)
cv_ex <- train(breaks~wool+tension, data = exd, method = "glm",family= "poisson", trControl = the_control)
m_ex <- glm(data = exd, breaks~wool+tension, family = "poisson")
results <- numeric(10 * cv_repeat_num)
for(j in 0:(cv_repeat_num - 1)){
cv_group <- sample(1:10, nrow(exd), replace = TRUE)
for(i in 1:10){
train_data <- exd[cv_group != i, ]
test_data <- exd[cv_group == i, ]
m_ex <- update(m_ex, data = train_data)
results[j * 10 + i] <- RMSE(
predict(m_ex, newdata = test_data),
test_data$breaks)
}
}
#RMSE from manual cross validation
mean(results)
#RMSE from RMSE function, no cross validation
RMSE(predict(m_ex, exd), exd$breaks)
#RMSE from train function
mean(cv_ex$resample$RMSE)
This difference in RMSE does not occur when I use a simple linear model instead of a poisson model as in this example. can someone shed some light on why this is? and if I could simply use the manual approach I found?