8

Let's imagine we want to model the US State Public-School Expenditures (education) using income, young, urban and region as regressors. For more info: ?Anscombe Model: education ~ (income+young+urban)*region

library(car)
library(leaps)

#Loading Data
data(Anscombe)
data(state)
stateinfo <- data.frame(region=state.region,row.names=state.abb)
datamodel <- data.frame(merge(stateinfo,Anscombe,by="row.names"),row.names=1)
head(datamodel)
   region education income young urban
AK   West       372   4146 439.7   484
AL  South       112   2337 362.2   584
AR  South       134   2322 351.9   500
AZ   West       207   3027 387.5   796
CA   West       273   3968 348.4   909
CO   West       192   3340 358.1   785

#Saturated Model
MOD1 <- lm(education~(.-region)*region,datamodel)
summary(MOD1)
#anova(MOD1)

#Look for the "best" model
MOD1.subset <- regsubsets(education~(.-region)*region,datamodel,nvmax=15)
plot(MOD1.subset) 

The model with 3 variables and 1 interaction (education~income+young+urban+RegionWest:young) seems to be the best in terms of BIC.

coef(MOD1.subset,4)

The questions is, how do I obtain the ML object from that model without writing the formula by hand?

Before posting, I found the package HH which has some interesting functions for regsubsets objects such as summaryHH and lm.regsubsets.

library(HH)
summaryHH(MOD1.subset)[4,]
lm.regsubsets(MOD1.subset,4)

lm.regsubsets does what I want but I think that has some problems parsing interactions, any alternatives?

Emer
  • 3,734
  • 2
  • 33
  • 47
  • I really scratch my head when thinking about that "best model". How on earth are you going to interprete a model where the interaction consists of a binary variable and only one level of a categorical variable? I have the uncomfortable feeling you don't completely grasp the reasoning behind the package... – Joris Meys Oct 25 '12 at 16:28

3 Answers3

3

I don't think this is going to be possible, at least not without a lot of processing of the names of the coefficients. I got ~95% of the way there but fell down on the interaction term:

coefs <- coef(MOD1s, 4)
nams <- names(coefs)
nams <- nams[!nams %in% "(Intercept)"]
response <- as.character(as.formula(MOD1s$call[[2]])[[2]])
form <- as.formula(paste(response, paste(nams, collapse = " + "), sep = " ~ "))
df <- get(as.character(MOD1s$call[[3]]))
mod <- lm(form, data = df)

> mod <- lm(form, data = df)
Error in eval(expr, envir, enclos) : object 'regionWest' not found

Which makes sense and arise from the names used for the coefficients:

> nams
[1] "income"           "young"            "urban"           
[4] "regionWest:young"

It is quite likely that with some effort you could do:

  1. split any names with : on :,
  2. for each side, see if there is a partial match with the variables in the data frame df
  3. if there is a match, check that the bit not matched matches a level of the variable in df after coercing to a factor if required,
  4. if there is a match with a level, replace that side of the interaction with the variable name,
  5. if there isn't a match, see if there are any other partial matched, fail if there aren't

and so on. That is quite a lot of programming involved for a [so] posting, but if you are up to the challenge then the above should give you something to start from.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
0

Sorry to bring this question back up, but I was looking for an answer to this myself.

The specific criterion used (e.g. AIC, BIC) does not affect the results of regsubsets since the function only compares against models of the same size and AIC differs from BIC only by the "penalty" assigned to model size. However, if you're interested in comparing models of different sizes, one may prefer to use AIC instead of BIC.

I don't think that regsubsets has the ability to plot AIC. However, AIC can easily be calculated using:

aic <-summary(leaps2)$bic + (2 - log(n))*(p+1)

where n is the number of samples and p is the number of parameters in the model (see the last page of http://stat.wharton.upenn.edu/~khyuns/stat431/ModelSelection.pdf for the definitions of aic and bic).

I tried tricking regsubsets into plotting the new aic value but was unsuccessful. However, you can easily compare models of different sizes by just looking at the matrix of aic values, and perhaps ordering them using 'order(aic)'

Steve
  • 135
  • 1
  • 6
0

I figured it out. So here's the code.

fit <- regsubsets(y~x,data=train)
b <- data[c(predictor columns)]
best <- order(summary(fit)$adjr2,decreasing=T)[1]
a <- as.integer(summary(fit)$which[best,][1:ncol(b)+1]
new_data <- data.frame(t( t(b) * a))
fit_lm <- lm(y~x,data=new_data)

You multiply the columns by the bool values. If you're input is always 0, it won't do anything for the model. No variance explained. You could loop through this if you wanted to by replacing the last index in the "best" variable with an "i" from a for loop.

WARNING: Your columns have to line up for your train/cross val/test data. If the first index of your training set is Gender and your first index of your cross validation set is Age, you're going to zero out the wrong column.

Side note: You can see that I picked the best model with respect to adjusted R^2 value. Feel free to change that.

Hope I helped. Cheers.

Jon Joh
  • 1
  • 1