0

I'm trying to compare backward selection vs linear regression for dimensional reduction. The dataset is rather big with 150 variables.

I have always used the same method to generate comparison with Cross Validation for selected models, but this time with this dataset, cv.glm gives an error that I have trouble to fix:

Error in model.frame.default(formula = SurveyTest$H.test ~ : variable lengths differ (found for 'Music')

There are no NA values in SurveyTest, I can't seem to detect other causes for length difference.

Code for Cross Validation:

#Linear regression full model
lm_full <- lm(SurveyTest$H.test~.,data=SurveyTest)
summary(lm_full)

#Backward selection
lm_init <- lm(H.test~1,data=SurveyTest)
backward_lm <- stepAIC(lm_full,scope = formula(lm_init),direction="backward", 
trace = FALSE)
summary(backward_lm)
AIC(backward_lm)

#Cross Validation
library(boot)
model1 <- glm(lm_full)
summary(lm_full)
model2 <- glm(backward_lm)
cv.glm(data=SurveyTest, glmfit=model1,K=10)
cv.glm(data=SurveyTest, glmfit=model2,K=10)
PKumar
  • 10,971
  • 6
  • 37
  • 52
lydias
  • 841
  • 1
  • 14
  • 32

1 Answers1

1

I think I found the solution. I should create lm_full with

lm_full <- lm(H.test~.,data=SurveyTest)

instead of

lm_full <- lm(SurveyTest$H.test~.,data=SurveyTest)

That solved the problem.

PKumar
  • 10,971
  • 6
  • 37
  • 52
lydias
  • 841
  • 1
  • 14
  • 32