k-fold cross-validation to find misclassification rate

Question

I am trying to write a function which takes a binary response variable y and a single explanatory variable x, runs 10-fold cross validation and returns the proportion of the response variables y that are incorrectly classified. I then want to run this function to find out which of the explanatory variables are best at predicting low birth weight.

So far I have tried

attach(birthwt)
folds <- cut(seq(1,nrow(birthwt)),breaks=10,labels=FALSE) #Create 10 equally size folds
bwt<-glm(low~age+lwt+race+smoke+ptl+ht+ui+ftv,family=binomial)

But I really don't know where to go from here..

Thank you

Anders Ellern Bilgrau · Accepted Answer · 2014-10-23T11:33:57.217

For each group, you need to construct a training and test data set, fit the model on the training set, and then use the predict.glm function to do the actual predictions in the test data. The comments below should be enough explanation.

library("MASS")   # for the dataset
library("dplyr")  # for the filter function
data(birthwt)
set.seed(1)

n.folds <- 10
folds <- cut(sample(seq_len(nrow(birthwt))),  breaks=n.folds, labels=FALSE) # Note!

all.confustion.tables <- list()
for (i in seq_len(n.folds)) {
  # Create training set
  train <- filter(birthwt, folds != i)  # Take all other samples than i

  # Create test set
  test <- filter(birthwt, folds == i)

  # Fit the glm model on the training set
  glm.train <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                   family = binomial, data =  train)

  # Use the fitted model on the test set
  logit.prob <- predict(glm.train, newdata = test)

  # Classifiy based on the predictions
  pred.class <- ifelse(logit.prob < 0, 0, 1)

  # Construct the confusion table
  all.confustion.tables[[i]] <- table(pred = pred.class, true = test$low)
}

Note, that I randomly select the rows to group. This is important.

We can then see the prediction versus the true values (given in the confusion tables) for the, say, 5th cross-validation run:

all.confustion.tables[[5]]
#    true
# pred 0 1
#    0 7 8
#    1 4 0

Hence, in this particular run, we have 7 true positives, 4 false positives, 8 false negatives, and 0 true negatives.

We can now compute whatever measure of performance we want from our confusion tables.

# Compute the average misclassification risk.
misclassrisk <- function(x) { (sum(x) - sum(diag(x)))/sum(x) }
risk <- sapply(all.confustion.tables, misclassrisk
mean(risk)
#[1] 0.3119883

So, in this case, we have a misclassification risk of about 31 %. I'll leave it to you to wrap this into a function.

Hi thank you so much for your response. However, I am still a little bit confused. Iam not familiar with 'confusion tables'. At which point do I insert each explanatory variable to compute which is the best at predicting the low birth weight? Your solution works perfectly when I run it in R studio but I just want to understand it a bit more :) — Rory78, Oct 23 '14 at 11:29
@Rory78 No problem. Confusion tables are just a cross-tabulation of the predicted and the true classes. I've edited the post to further explain. Your other questions are statistical questions, not programming questions. You'll probably get better answers for that on one of the statistical sister sites. — Anders Ellern Bilgrau, Oct 23 '14 at 11:35
Ok thank you for your explanations - very helpful! I'll try and get my head around it all now! — Rory78, Oct 23 '14 at 11:38
I am still having a bit of trouble understanding the misclassification rate. The code returns the mean risk but how do you identify the value of the risk for each explanatory variable. — Rory78, Oct 24 '14 at 19:13

k-fold cross-validation to find misclassification rate

1 Answers1