while taking Coursera's "Reproducible Research" class, I had trouble understanding the code the instructor used for a logarithmic regression.
This code is using data from the kernlab library's spam dataset. This data classifies 4601 e-mails as spam or non-spam. In addition to this class label there are 57 variables indicating the frequency of certain words and characters in the e-mail. The data has been split between a test and a training dataset.
This code in particular is taking the training dataset ("trainSpam"). What it is supposed to do is to go through each of the variables in the data set and try to fit a generalizing model, in this case a logistic regression, to predict an email is spam or not by using just a single variable.
I really don't understand what some of the lines in the code are doing. Could someone please explain it to me. Thank you.
trainSpam$numType = as.numeric(trainSpam$type) - 1 ## here a new column is just being created assigning 0 and 1 for spam and nonspam emails
costFunction = function(x,y) sum(x != (y > 0.5)) ## I understand a function is being created but I really don't understand what the function "costFunction" is supposed to do. I could really use and explanation for this
cvError = rep(NA,55)
library(boot)
for (i in 1:55){
lmFormula = reformulate(names(trainSpam)[i], response = "numType") ## I really don't understand this line of code either
glmFit = glm(lmFormula, family = "binomial", data = trainSpam)
cvError[i] = cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]
}
names(trainSpam)[which.min(cvError)]