5

Is there a way of automating variable selection of a GAM in R, similar to step? I've read the documentation of step.gam and selection.gam, but I've yet to see a answer with code that works. Additionally, I've tried method= "REML" and select = TRUE, but neither remove insignificant variables from the model.

I've theorized that I could create a step model and then use those variables to create the GAM, but that does not seem computationally efficient.

Example:

library(mgcv)

set.seed(0)
dat <- data.frame(rsp = rnorm(100, 0, 1), 
                  pred1 = rnorm(100, 10, 1), 
                  pred2 = rnorm(100, 0, 1), 
                  pred3 = rnorm(100, 0, 1), 
                  pred4 = rnorm(100, 0, 1))

model <- gam(rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4),
             data = dat, method = "REML", select = TRUE)

summary(model)

#Family: gaussian 
#Link function: identity 

#Formula:
#rsp ~ s(pred1) + s(pred2) + s(pred3) + s(pred4)

#Parametric coefficients:
#            Estimate Std. Error t value Pr(>|t|)
#(Intercept)  0.02267    0.08426   0.269    0.788

#Approximate significance of smooth terms:
#            edf Ref.df     F p-value  
#s(pred1) 0.8770      9 0.212  0.1174  
#s(pred2) 1.8613      9 0.638  0.0374 *
#s(pred3) 0.5439      9 0.133  0.1406  
#s(pred4) 0.4504      9 0.091  0.1775  
---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#R-sq.(adj) =  0.0887   Deviance explained = 12.3%
#-REML = 129.06  Scale est. = 0.70996   n = 100
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
IJH
  • 167
  • 1
  • 11
  • In some of the other data sets I use, I have upwards of ten variables (which I realize is not that many in the scheme of statistics), and I'd like to cut down on some variables without great loss in predicting power. – IJH Jul 25 '16 at 15:06
  • I'm voting to close this question as off-topic because is not about programming but statistics (model selection) – Ricardo Oliveros-Ramos Oct 25 '17 at 09:57

2 Answers2

5

Marra and Wood (2011, Computational Statistics and Data Analysis 55; 2372-2387) compare various approaches for feature selection in GAMs. They concluded that an additional penalty term in the smoothness selection procedure gave the best results. This can be activated in mgcv::gam() by using the select = TRUE argument/setting, or any of the following variations:

model <- gam(rsp ~ s(pred1,bs="ts") + s(pred2,bs="ts") + s(pred3,bs="ts") + s(pred4,bs="ts"), data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="cr") + s(pred2,bs="cr") + s(pred3,bs="cr") + s(pred4,bs="cr"),
             data = dat, method = "REML",select=T)
model <- gam(rsp ~ s(pred1,bs="cc") + s(pred2,bs="cc") + s(pred3,bs="cc") + s(pred4,bs="cc"),
             data = dat, method = "REML")
model <- gam(rsp ~ s(pred1,bs="tp") + s(pred2,bs="tp") + s(pred3,bs="tp") + s(pred4,bs="tp"), data = dat, method = "REML")
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • What does the addition of `bs = ` do the model? – IJH Jul 26 '16 at 12:31
  • `bs = ` defines the smoother chosen as basis in fitting the effect of the predictor variable. For instannce, `cc` is cyclic cubic regression splines and `tp` is thin-plate spline. More in the [gam documentation](https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.terms.html) – Hugo Fernando Maia Milan Oct 23 '18 at 11:54
0

In addition to specifying select = TRUE in your call to function gam, you can increase the value of argument gamma to get stronger penalization. For example, we generate some data:

library("mgcv")
set.seed(2) 
dat <- gamSim(1, n=400, dist="normal", scale=5)
## Gu & Wahba 4 term additive model

We fit a GAM with 'standard' penalization and variable selection:

b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML")
summary(b)
##
## Family: gaussian 
## Link function: identity 
##
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.890      0.246   32.07   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df      F  p-value    
## s(x0) 1.363  1.640  0.804   0.3174    
## s(x1) 1.681  2.088 11.309 1.35e-05 ***
## s(x2) 5.931  7.086 16.240  < 2e-16 ***
## s(x3) 1.002  1.004  4.102   0.0435 *  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## R-sq.(adj) =  0.253   Deviance explained = 27.1%
## -REML = 1212.5  Scale est. = 24.206    n = 400
par(mfrow = c(2, 2)) 
plot(b)

enter image description here

We fit a GAM with stronger penalization and variable selection:

b2 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data=dat, method = "REML", select = TRUE, gamma = 7)
## summary(b2)

## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x0) + s(x1) + s(x2) + s(x3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.8898     0.2604    30.3   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value    
## s(x0) 5.330e-05      9 0.000  0.1868    
## s(x1) 5.427e-01      9 0.967 7.4e-05 ***
## s(x2) 1.549e+00      9 6.210 < 2e-16 ***
## s(x3) 6.155e-05      9 0.000  0.0812 .  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## R-sq.(adj) =  0.163   Deviance explained = 16.7%
## -REML = 179.46  Scale est. = 27.115    n = 400
plot(b2)

enter image description here

According to the documentation, increasing the value of gamma produces smoother models, because it multiplies the effective degrees of freedom in the GCV or UBRE/AIC criterion.

A possible downside is thus that all non-linear effects will be shrunken towards linear effects, and all linear effects will be shrunken towards zero. This is what we also observe in the plots and output above: With higher value of gamma, some effects are practically penalized out (edf values close 0, F-value of 0), while the other effects are closer to linear (edf values closer to 1).