Variable Selection with mgcv

Question

Is there a way of automating variable selection of a GAM in R, similar to step? I've read the documentation of step.gam and selection.gam, but I've yet to see a answer with code that works. Additionally, I've tried method= "REML" and select = TRUE, but neither remove insignificant variables from the model.

I've theorized that I could create a step model and then use those variables to create the GAM, but that does not seem computationally efficient.

Example:

library(mgcv)

set.seed(0)
dat <- data.frame(rsp = rnorm(100, 0, 1), 
                  pred1 = rnorm(100, 10, 1), 
                  pred2 = rnorm(100, 0, 1), 
                  pred3 = rnorm(100, 0, 1), 
                  pred4 = rnorm(100, 0, 1))

model <- gam(rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4),
             data = dat, method = "REML", select = TRUE)

summary(model)

#Family: gaussian 
#Link function: identity 

#Formula:
#rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4)

#Parametric coefficients:
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept)  0.02267    0.08426   0.269    0.788

#Approximate significance of smooth terms:
#            edf Ref.df     F p-value  
#s(pred1) 0.8770      9 0.212  0.1174  
#s(pred2) 1.8613      9 0.638  0.0374 *
#s(pred3) 0.5439      9 0.133  0.1406  
#s(pred4) 0.4504      9 0.091  0.1775  
---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#R-sq.(adj) =  0.0887   Deviance explained = 12.3%
#-REML = 129.06  Scale est. = 0.70996   n = 100

In some of the other data sets I use, I have upwards of ten variables (which I realize is not that many in the scheme of statistics), and I'd like to cut down on some variables without great loss in predicting power. — IJH, Jul 25 '16 at 15:06
I'm voting to close this question as off-topic because is not about programming but statistics (model selection) — Ricardo Oliveros-Ramos, Oct 25 '17 at 09:57

score 5 · Answer 1 · answered Jul 25 '16 at 18:38

Marra and Wood (2011, Computational Statistics and Data Analysis 55; 2372-2387) compare various approaches for feature selection in GAMs. They concluded that an additional penalty term in the smoothness selection procedure gave the best results. This can be activated in mgcv::gam() by using the select = TRUE argument/setting, or any of the following variations:

model <- gam(rsp ~ s(pred1,bs="ts") + s(pred2,bs="ts") + s(pred3,bs="ts") + s(pred4,bs="ts"), data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="cr") + s(pred2,bs="cr") + s(pred3,bs="cr") + s(pred4,bs="cr"),
             data = dat, method = "REML",select=T)
model <- gam(rsp ~ s(pred1,bs="cc") + s(pred2,bs="cc") + s(pred3,bs="cc") + s(pred4,bs="cc"),
             data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="tp") + s(pred2,bs="tp") + s(pred3,bs="tp") + s(pred4,bs="tp"), data = dat, method = "REML")

`bs = ` defines the smoother chosen as basis in fitting the effect of the predictor variable. For instannce, `cc` is cyclic cubic regression splines and `tp` is thin-plate spline. More in the [gam documentation](https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.terms.html) — Hugo Fernando Maia Milan, Oct 23 '18 at 11:54

score 0 · Answer 2 · answered Jan 19 '23 at 16:18

In addition to specifying select = TRUE in your call to function gam, you can increase the value of argument gamma to get stronger penalization. For example, we generate some data:

library("mgcv")
set.seed(2) 
dat <- gamSim(1, n=400, dist="normal", scale=5)
## Gu & Wahba 4 term additive model

We fit a GAM with 'standard' penalization and variable selection:

b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML")
summary(b)
##
## Family: gaussian 
## Link function: identity 
##
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.890      0.246   32.07   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df      F  p-value    
## s(x0) 1.363  1.640  0.804   0.3174    
## s(x1) 1.681  2.088 11.309 1.35e-05 ***
## s(x2) 5.931  7.086 16.240  < 2e-16 ***
## s(x3) 1.002  1.004  4.102   0.0435 *  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## R-sq.(adj) =  0.253   Deviance explained = 27.1%
## -REML = 1212.5  Scale est. = 24.206    n = 400
par(mfrow = c(2, 2)) 
plot(b)

We fit a GAM with stronger penalization and variable selection:

b2 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML", select = TRUE, gamma = 7)
## summary(b2)

## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.8898     0.2604    30.3   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value    
## s(x0) 5.330e-05      9 0.000  0.1868    
## s(x1) 5.427e-01      9 0.967 7.4e-05 ***
## s(x2) 1.549e+00      9 6.210 < 2e-16 ***
## s(x3) 6.155e-05      9 0.000  0.0812 .  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## R-sq.(adj) =  0.163   Deviance explained = 16.7%
## -REML = 179.46  Scale est. = 27.115    n = 400
plot(b2)

According to the documentation, increasing the value of gamma produces smoother models, because it multiplies the effective degrees of freedom in the GCV or UBRE/AIC criterion.

A possible downside is thus that all non-linear effects will be shrunken towards linear effects, and all linear effects will be shrunken towards zero. This is what we also observe in the plots and output above: With higher value of gamma, some effects are practically penalized out (edf values close 0, F-value of 0), while the other effects are closer to linear (edf values closer to 1).

Variable Selection with mgcv

2 Answers2