Selecting the statistically significant variables in an R glm model

Question

I have an outcome variable, say Y and a list of 100 dimensions that could affect Y (say X1...X100).

After running my glm and viewing a summary of my model, I see those variables that are statistically significant. I would like to be able to select those variables and run another model and compare performance. Is there a way I can parse the model summary and select only the ones that are significant?

Try the [glmulti](http://www.jstatsoft.org/v34/i12/paper) package. — krlmlr, Apr 22 '13 at 18:24
In addition, you must be warned against selecting "significant" variables in this fashion. Statistical significance can be changed with addition/removal of a single independent variable. Your question suggests the removal of *all* variables insignificant on the first run. In doing that, some of the initially significant variables will become insignificant, whereas some of the variables you have removed may have had good predictive value. What you really want is removal one by one, and stepwise comparison of model fit. See this thread: http://bit.ly/ZLVaD5 — Maxim.K, Apr 23 '13 at 07:40
See also this: http://www.statmethods.net/stats/regression.html — Maxim.K, Apr 23 '13 at 07:42
@Maxim.K Stepwise regression is frowned upon over at CrossValidated. As I said in chat, I might approach this problem with the lasso. Anyway, that's off-topic here. — Roland, Apr 23 '13 at 08:08

Maxim.K · Answer 1 · 2015-08-30T05:25:25.233

Although @kith paved the way, there is more that can be done. Actually, the whole process can be automated. First, let's create some data:

x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
x4 <- y + 5 # this will make a nice significant variable to test our code
(mydata <- as.data.frame(cbind(x1,x2,x3,x4,y)))

Our model is then:

model <- glm(formula=y~x1+x2+x3+x4,data=mydata)

And the Boolean vector of the coefficients can indeed be extracted by:

toselect.x <- summary(model)$coeff[-1,4] < 0.05 # credit to kith

But this is not all! In addition, we can do this:

# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE] 
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",relevant.x))

EDIT: as subsequent posters have pointed out, the latter line should be sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+"))) to include all variables.

And run the regression with only significant variables as OP originally wanted:

sig.model <- glm(formula=sig.formula,data=mydata)

In this case the estimate will be equal to 1 as we have defined x4 as y+5, implying the perfect relationship.

This was great, thanks! But I had to change the sig.formula a little for it to work for me: sig.formula <- as.formula(paste(" y ~", paste(relevant.x, collapse=" + "))) Without the collapse it only took the first variable name from relevant.x — ElinaJ, Aug 29 '15 at 14:16
Indeed, other posters have noted this. I've included the improvement in the answer for clarity. — Maxim.K, Aug 30 '15 at 05:26
When I do this it does not work for variables that get turned into factors. Is there a way around this? — Alberto MQ, Nov 21 '19 at 20:47

score 7 · Accepted Answer · answered Apr 22 '13 at 18:24

You can get access the pvalues of the glm result through the function "summary". The last column of the coefficients matrix is called "Pr(>|t|)" and holds the pvalues of the factors used in the model.

Here's an example:

#x is a 10 x 3 matrix
x = matrix(rnorm(3*10), ncol=3)
y = rnorm(10)
res = glm(y~x)
#ignore the intercept pval
summary(res)$coeff[-1,4] < 0.05

score 2 · Answer 3 · answered May 23 '15 at 14:19

For people having issue with Maxim.K command on

sig.formula <- as.formula(paste("y ~",relevant.x))

use this

sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))

Final codes will be like

toselect.x <- summary(glmText)$coeff[-1,4] < 0.05 # credit to kith
# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE] 
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))

this fixes the bug you're facing with picking of the first variable alone.

score 1 · Answer 4 · answered May 03 '13 at 17:46

1

in

sig.formula <- as.formula(paste("y ~",relevant.x))

you paste only the first variable of relevant.x the others are ignored (try for example to invert the condition to >0.5)

answered May 03 '13 at 17:46

user2347888

11
1

Selecting the statistically significant variables in an R glm model

4 Answers4

Linked