cv.glm variable lengths differ

Question

I am trying to cv.glm on a linear model however each time I do I get the error

Error in model.frame.default(formula = lindata$Y ~ 0 + lindata$HomeAdv +  : 
variable lengths differ (found for 'air-force-falcons')

air-force-falcons is the first variable in the dataset lindata. When I run glm I get no errors. All the variables are in a single dataset and there are no missing values.

> linearmod5<- glm(lindata$Y ~ 0 + lindata$HomeAdv + ., data=lindata, na.action="na.exclude")
> set.seed(1)
> cv.err.lin=cv.glm(lindata,linearmod5,K=10)
Error in model.frame.default(formula = lindata$Y ~ 0 + lindata$HomeAdv +  : 
variable lengths differ (found for 'air-force-falcons')

I do not know what is driving this error or the solution. Any ideas? Thank you!

Your error is here `. -lindata$HomeAdv` what are you trying to achieve with this? — BBrill, Feb 05 '15 at 18:01
Even without that, the error remains: `> linearmod5<- glm(lindata$Y ~ 0 + lindata$HomeAdv + ., data=lindata, na.action="na.exclude") > set.seed(1) > cv=cv.glm(lindata,linearmod5,K=10) Error in model.frame.default(formula = lindata$Y ~ 0 + lindata$HomeAdv + : variable lengths differ (found for 'air-force-falcons')` — RetaK, Feb 05 '15 at 18:52

BBrill · Accepted Answer · 2015-02-05T21:47:14.717

What is causing this error is a mistake in the way you specify the formula

This will produce the error:

mod <- glm(mtcars$cyl ~ mtcars$mpg + .,
            data = mtcars, na.action = "na.exclude")

cv.glm(mtcars, mod, K=11) #nrow(mtcars) is a multiple of 11

This not:

mod <- glm(cyl ~ ., data = mtcars)

cv.glm(mtcars, mod, K=11)

neither this:

mod <- glm(cyl ~ + mpg + disp, data = mtcars)

cv.glm(mtcars, mod, K=11)

What happens is that you specify the variable in like mtcars$cyl this variable have a number of rows equal to that of the original dataset. When you use cv.glm you partition the data frame in K parts, but when you refit the model on the resampled data it evaluates the variable specified in the form data.frame$var with the original (non partitioned) length, the others (that specified by .) with the partitioned length.

So you have to use relative variable in the formula (without $).

Other advices on formula:

avoid using a mix of specified variables and . you double variables. The dot is for all vars in the df except those on the left of tilde.

Why do you add a zero? if it is in the attempt to remove the intercept use -1 instead. However, this is a bad practice in my opinion

cv.glm variable lengths differ

1 Answers1

Linked