AIM: The aim here was to find a suitable fit, using step functions, which uses age
to describe wage
, in the Wage
dataset in the library ISLR
.
PLAN:
To find a suitable fit, I'll try multiple fits, which will have different cut points. I'll use the glm()
function (of the boot
library) for the fitting purpose. In order to check which fit is the best, I'll use the cv.glm()
function to perform cross-validation over the fitted model.
PROBLEM:
In order to do so, I did the following:
all.cvs = rep(NA, 10)
for (i in 2:10) {
lm.fit = glm(wage~cut(Wage$age,i), data=Wage)
all.cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}
But this gives an error:
Error in model.frame.default(formula = wage ~ cut(Wage$age, i), data =
list( : variable lengths differ (found for 'cut(Wage$age, i)')
Whereas, when I run the code given below, it runs.(It can be found here)
all.cvs = rep(NA, 10)
for (i in 2:10) {
Wage$age.cut = cut(Wage$age, i)
lm.fit = glm(wage~age.cut, data=Wage)
all.cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}
Hypotheses and Results:
Well, it might be possible that
cut()
andglm()
might not work together. But this works:glm(wage~cut(age,4),data=Wage)
Question:
So, basically we're using the cut()
function, saving it's results in a variable, then using that variable in the glm()
function. But we can't put the cut function inside the glm()
function. And that too, only if the code is in a loop.
So, why is the first version of the code not working?
This is confusing. Any help appreciated.