Manual LOOCV vs cv.glm

Question

In Introduction to Statistical Learning we're asked to do the Leave Out One Cross Validation over logistic regression manually. The code for it is here:

count = rep(0, dim(Weekly)[1])
for (i in 1:(dim(Weekly)[1])) {
##fitting a logistic regression model, not including ith data in the training data
    glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)

    is_up = predict.glm(glm.fit, Weekly[i, ], type = "response") > 0.5

    is_true_up = Weekly[i, ]$Direction == "Up"
    if (is_up != is_true_up) 
        count[i] = 1
}
sum(count)
##[1] 490

The source of this code can be found here.

Which means that the error rate is approximately 45 %. But when we do it, using the cv.glm() function of the boot library, the result is far different.

> library(boot)
> glm.fit = glm(Direction~Lag1+Lag2,data=Weekly,family=binomial)
> cv.glm = cv.glm(Weekly,glm.fit)
> cv.glm$delta
[1] 0.2464536 0.2464530

Why does this occur? What does the cv.glm() function exactly do?

Probably, you need to provide the cost function to compute the classification error since the default setting computes the mean squared error. Check out the last paragraph of `?boot::cv.glm`. The example there looks very similar to your case. — Kota Mori, Aug 08 '17 at 11:40

score 1 · Answer 1 · answered Nov 22 '22 at 17:21

I believe there may be a bug in the cv.glm function. On line 23 it calculates cost(glm.y, fitted(glmfit)) where fitted(glmfit) are fitted probabilities. In order to calculate cross-validated error rate (= total number of misclassified observations over n), we first need to map these to classes. In other words, if you replace

cost.0 <- cost(glm.y, fitted(glmfit))

with

cost.0 <- cost(glm.y, ifelse(fitted(glmfit)>0.5, 1, 0))

I believe you should get the same thing as what you coded up manually.

Manual LOOCV vs cv.glm

1 Answers1