Subset by Chi Sq Values from GAM results in R

Question

I've run a for-loop in R that generates models for a binomial GAM for 200 different random data combinations (200 different set.seed values).

The for-loop and GAMs run just fine, and they store the models in the appropriate list, model[[i]], with each list element representing a model for a certain set.seed iteration.

I can run summary() on an individual list element (model[[5]], for example) and get something like this:

Approximate significance of smooth terms:
                      edf Ref.df Chi.sq  p-value    
s(Petal.Width)  1.133e+00      9  5.414 0.008701 ** 
s(Sepal.Length) 8.643e-01      9  2.338 0.067509 .  

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.361   Deviance explained = 32.7%
-REML = 83.084  Scale est. = 1         n = 160

Since I've got 200 of these elements in my model list, I was wondering if there's a quick way to go through and count how many times (out of the total 200) that Chi.sq value is equal to 0 for the Petal.Width variable. Basically, since I have the GAMs set up with bs = "cs", the number of times that Chi.sq is equal to 0 represents how often the variable-selection process removed that variable from the model.

Here's a cleaned-up version of the code I used for the for-loop to build the model:


a <- c(1:200)
model <- list()


for (i in a) {
  
#In my version there's some extra code here that subsets iris based on i 
  
  model[[i]] <- gam(Species~ s(Petal.Width, bs="cs") + 
                      s(Sepal.Length, bs="cs"),
                    data = iris,
                    family = binomial,
                    method = "REML")

}

could you include the code for your model? I've used the `tidy()` command from the `broom` package which puts model results in a tibble to collect output then you can put together columns. — brian avery, Jan 27 '21 at 21:27
@brianavery Made some edits - is that what you were looking for? — pfadenhw, Jan 27 '21 at 21:44
helpful, thanks, but which package are you using for `gam` -- `gam` or `mgcv`? — brian avery, Jan 27 '21 at 21:47

score 0 · Accepted Answer · answered Jan 27 '21 at 21:58

I've found the tidy() function from the broom package to be helpful in this type of situation.
here's your model (I cut down to number of iterations to 10 to make it go faster,
also it was throwing warnings using mgcv::gam())

a <- c(1:10)
model <- list()

for (i in a) {
  model[[i]] <- gam(Species~ s(Petal.Width, bs="cs") + 
                      s(Sepal.Length, bs="cs"),
                    data = iris,
                    family = binomial,
                    method = "REML")
}

If you needed or wanted the list of all the models, then this second loop will extract the statistic from each model and combine them into a dataframe:

results_df <- data.frame(term=c("Petal.Width","Sepal.Length"))
for(m in model){
  results_df <- bind_cols(results_df, model=tidy(m)$statistic)
}
results_df

with output:

          term    model...2    model...3    model...4    model...5    model...6
1  Petal.Width 1.304372e-09 1.304372e-09 1.304372e-09 1.304372e-09 1.304372e-09
2 Sepal.Length 3.844738e-16 3.844738e-16 3.844738e-16 3.844738e-16 3.844738e-16
     model...7    model...8    model...9   model...10   model...11
1 1.304372e-09 1.304372e-09 1.304372e-09 1.304372e-09 1.304372e-09
2 3.844738e-16 3.844738e-16 3.844738e-16 3.844738e-16 3.844738e-16

you can rename the columns to make them nicer, etc.
also you could combine the modeling and the combining of the statistics if that's all you want to keep.

Whoops - I had reduced the number of variables when I simplified the code for this question so disregard the previous comment (now deleted). — pfadenhw, Jan 28 '21 at 14:26
However, I now have the `results_df` dataframe, but the statistics in there don't match the Chi.sq values that I get if I manually pull up `summary(model[[1]])` — pfadenhw, Jan 28 '21 at 14:27

score 0 · Answer 2 · answered Jan 28 '21 at 15:47

Using info from @brian avery and this question I've come up with the following that exactly answers my original question. I am sure there is a more efficient way to do some of this with pipes, but this works for me.

a <- c(1:200)
model <- list()
sum <- list ()
Chisq <- list()


for (i in a) {
  model[[i]] <- gam(Species~ s(Petal.Width, bs="cs") + 
                      s(Sepal.Length, bs="cs"),
                    data = iris,
                    family = binomial,
                    method = "REML")

  sum[[i]] <- summary(model[[i]])
  Chisq[[i]] <- sum[[i]]$chi.sq

}


results_df <- data.frame(term=c("Petal.Width","Sepal.Length"))

for(i in a){
  results_df <- bind_cols(results_df, Chisq[[i]])
}

rowSums(results_df<0.001)

The last line counts how many times the variables ended up with a Chi.sq value below 0.001. It answers the question of how many times (out of 200) the mgcv package automatically shrunk that variable out of the model.

Subset by Chi Sq Values from GAM results in R

2 Answers2