1

In Introduction to Statistical Learning we're asked to do the Leave Out One Cross Validation over logistic regression manually. The code for it is here:

count = rep(0, dim(Weekly)[1])
for (i in 1:(dim(Weekly)[1])) {
##fitting a logistic regression model, not including ith data in the training data
    glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)

    is_up = predict.glm(glm.fit, Weekly[i, ], type = "response") > 0.5

    is_true_up = Weekly[i, ]$Direction == "Up"
    if (is_up != is_true_up) 
        count[i] = 1
}
sum(count)
##[1] 490

The source of this code can be found here.

Which means that the error rate is approximately 45 %. But when we do it, using the cv.glm() function of the boot library, the result is far different.

> library(boot)
> glm.fit = glm(Direction~Lag1+Lag2,data=Weekly,family=binomial)
> cv.glm = cv.glm(Weekly,glm.fit)
> cv.glm$delta
[1] 0.2464536 0.2464530

Why does this occur? What does the cv.glm() function exactly do?

Mooncrater
  • 4,146
  • 4
  • 33
  • 62
  • 2
    Probably, you need to provide the cost function to compute the classification error since the default setting computes the mean squared error. Check out the last paragraph of `?boot::cv.glm`. The example there looks very similar to your case. – Kota Mori Aug 08 '17 at 11:40
  • Okay! Thanks @KotaMori. – Mooncrater Aug 08 '17 at 16:25

1 Answers1

1

I believe there may be a bug in the cv.glm function. On line 23 it calculates cost(glm.y, fitted(glmfit)) where fitted(glmfit) are fitted probabilities. In order to calculate cross-validated error rate (= total number of misclassified observations over n), we first need to map these to classes. In other words, if you replace

cost.0 <- cost(glm.y, fitted(glmfit))

with

cost.0 <- cost(glm.y, ifelse(fitted(glmfit)>0.5, 1, 0))

I believe you should get the same thing as what you coded up manually.