0

I am trying to use the function cv.glm() from the boot package for validation of a linear model. first I run the model and it works fine:

> linear_model_red<-glm(red_wine_data$quality~.,data=red_wine_data)

then, I want to do the validation:

cv.glm(red_wine_data,linear_model_red)

and it fives me the error:

Error in model.frame.default(formula = red_wine_data$quality ~ ., data = list( : 
  variable lengths differ (found for 'fixed acidity')

I don't have any missing data at all, I checked. and also all of my variables are the same lengh:

sapply(red_wine_data,function(x) length(x))
           fixed acidity             volatile acidity 
                    1599                         1599 
             citric acid               residual sugar 
                    1599                         1599 
               chlorides          free sulfur dioxide 
                    1599                         1599 
    total sulfur dioxide                      density 
                    1599                         1599 
                      pH                    sulphates 
                    1599                         1599 
                 alcohol                      quality 
                    1599                         1599 
volatile acidity*citric acid   volatile acidity*sulphates 
                        1599                         1599 
    volatile acidity*alcohol        citric acid*sulphates 
                        1599                         1599 
         citric acid*alcohol            sulphates*alcohol 
                        1599                         1599 

please help!

Michael Petch
  • 46,082
  • 8
  • 107
  • 198
Rita
  • 19
  • 1
  • 1
  • 4

1 Answers1

5

Don't use the $ operator inside a formula:

linear_model_red<-glm(red_wine_data$quality~.,data=red_wine_data)

Instead, do this:

linear_model_red<-glm(quality~.,data=red_wine_data)

The reason is that by using $, you're telling R that your model should use a fixed vector of numbers for your response. In this case, that's the quality column in the red_wine_data data frame.

When you fit your initial model, that's okay, because all the other variables are also coming from that data frame. However, when you call cv.glm to do crossvalidation, R will still try to use that same fixed vector for your response. This no longer works, because the point of crossvalidation is to use a subset of the data to fit the model, and then test it on a different subset. By removing the $ (and the red_wine_data on its left), you tell R to look for the quality variable inside the dataset specified by the data argument -- which cv.glm will set as part of what it does. This means the response will match up with the other variables in your model.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187