14

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:

library(boot)
cv.glm(data, glmfit, K=10)

Does the "data" argument here refer to the whole dataset or only to the test set?

The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).

Unfortunately ?cv.glm explains it in a foggy way:

data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response

My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?

Here's what my script looks like:

##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]

##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
        family = "binomial", data = training)

##cross-validation
cv.glm(testing, model, K=10)
SiHa
  • 7,830
  • 13
  • 34
  • 43
Error404
  • 6,959
  • 16
  • 45
  • 58
  • Look at the example section of `boot:::cv.glm`. You should input the whole data, the model and the fold of CV. – Roman Luštrik Jan 27 '14 at 13:07
  • Thanks for your reply @RomanLuštrik. Sounds great. I am still wondering about a couple of things though. Does this function use all the supplied data in the cross-validation? suppose I supplied a dataframe of a 1000 rows for the `cv.glm(data, glm, K=10)` does it make 10 paritions of the data, each of a 100 and make the cross validation? Sorry I have been through the ?cv.glm but I did not find that there. – Error404 Jan 27 '14 at 13:28
  • 1
    If you would be doing a 2 fold CV, the function would take 50% of the data and fit the model. It would use the other 50% of the data to see how well the model describes the data. Or, in leave-one-out CV, it would fit the model to all but one data "point", and see how well the singled out "point" did. Repeat N times and you get your result. – Roman Luštrik Jan 27 '14 at 20:38
  • 4
    Hi @RomanLuštrik. You said that if I did a 2-fold CV, the function will fit the model according to 50% of the data and use the other 50% as a test set. If the function does that, then why does it require an argument "glmfit" which is a previously fitted model? – Error404 Jan 28 '14 at 11:03
  • 2
    If you have a question on crossvalidation, I suggest you open a thread at crossvalidated.com. – Roman Luštrik Jan 29 '14 at 09:02

2 Answers2

19

I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:

#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- yourData[testIndexes, ]
    trainData <- yourData[-testIndexes, ]
    #Use test and train data partitions however you desire...
}
Jake Drew
  • 2,230
  • 23
  • 29
  • 1
    Thanks Jake Drew. For testing purpose, I have compared your code above with the results from cv.glm and results are identical. Thanks to your post, I can now trust cv.glm ;-) – citraL Sep 09 '15 at 15:29
6

@Roman provided some answers in his comments, however, the answer to your questions is provided by inspecting the code with cv.glm:

I believe this bit of code splits the data set up randomly into the K-folds, arranging rounding as necessary if K does not divide n:

if ((K > n) || (K <= 1)) 
    stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0)) 
    K <- kvals[temp == min(temp)][1L]
if (K != K.o) 
    warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)

This bit here shows that the delta value is NOT the root mean square error. It is, as the helpfile says The default is the average squared error function. What does this mean? We can see this by inspecting the function declaration:

function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2), 
    K = n) 

which shows that within each fold, we calculate the average of the error squared, where error is in the usual sense between predicted response vs actual response.

delta[1] is simply the weighted average of the SUM of all of these terms for each fold, see my inline comments in the code of cv.glm:

for (i in seq_len(ms)) {
    j.out <- seq_len(n)[(s == i)]
    j.in <- seq_len(n)[(s != i)]
    Call$data <- data[j.in, , drop = FALSE]
    d.glm <- eval.parent(Call)
    p.alpha <- n.s[i]/n #create weighted average for later
    cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out, 
        , drop = FALSE], type = "response"))
    CV <- CV + p.alpha * cost.i # add weighted average error to running total
    cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm, 
        data, type = "response"))
}
Alex
  • 15,186
  • 15
  • 73
  • 127