3

I am now using the package mgcv to build a GAMM in R, and my questions are:

  • First, how can we know if the random effect is statistically significant or not?
  • Second, how can we extract the random intercept values in the model?
  • Third, what does the "offset" mean in gamm? I have checked the R help, but I am still confused about the "offset" term in the function? Thanks for any help.

The example is taken from the book Generalized additive models: an introduction with R

library(mgcv)
library(gamair)

data(sole)
sole$off <- log(sole$a.1-sole$a.0)
sole$a<-(sole$a.1+sole$a.0)/2 
solr<-sole
solr$t<-solr$t-mean(sole$t)
solr$t<-solr$t/var(sole$t)^0.5
solr$la<-solr$la-mean(sole$la)
solr$lo<-solr$lo-mean(sole$lo)

solr$station <- factor(with(solr,paste(-la,-lo,-t,sep="")))  
som <- gamm(eggs~te(lo,la,t,bs=c("tp","tp"),k=c(25,5),d=c(2,1))
        +s(t,k=5,by=a)+offset(off), family=quasipoisson,
        data=solr,random=list(station=~1))
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Yang Yang
  • 858
  • 3
  • 26
  • 49
  • `som` and `str(som)` might show you some of the information it contains – Henry Jan 17 '19 at 23:35
  • @Henry Yes, `som` and `str(som)` can show some information, but they do not show the statistical significance of the random effect. – Yang Yang Jan 18 '19 at 00:13
  • start with http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#testing-significance-of-random-effects ; do a likelihood ratio test between the full and reduced models. – Ben Bolker Jan 18 '19 at 00:23
  • @BenBolker Hi, thank you for your help. Could you please formally answer the question so I can accept your answer? Meanwhile, others can also benefit. Thanks – Yang Yang Jan 19 '19 at 02:40

1 Answers1

3

Note that for this model it might make more sense to use a Tweedie response via family tw with gam() and bam(), which can't be used with gamm(). In fact, Simon Wood and Matteo Fasiolo fit these data with a location scale Tweedie GAM (wherein they model the mean, variance and power parameters of the Tweedie distribution each with a separate linear predictor [model]).

At @BenBolker's suggestion: I wouldn't even bother testing the random effect in this model specifically, and often I don't care if it is significant or not. It depends on the question or hypothesis I am working on. Often I want it in the model due to some clustering in the data that I want included in the model regardless of the significance.

However, I'm not convinced that the theory of the (Generalized) likelihood ratio test (GLRT) doesn't apply to the use of quasi-likelihood in this instance. Simon Wood presents derivations in Appendix A of the 2nd edition of his textbook on GAMS that show that the previously-derived results for maximum likelihood estimation (which include results for the GLRT) hold if we replace the log likelihood with the log quasi likelihood. This, at least Simon seems to be arguing, would suggest that the interpretation of the test I mention below and which is implemented in summary.gam() for random effects, is as reliable as if it were based on a proper likelihood.

Unless I really needed to, I'd fit this model with gam() or bam() and then gamm4() (the latter from the gamm4 package), before gamm(), especially for non-Gaussian models, as the gamm() function has to fit this model as a mixed effects model using penalised quasi likelihood, which is not necessarily the best way of estimating these models.

library(mgcv)
library(gamair)
devtools::install_github('gavinsimpson/gratia')
library(gratia)

data(sole)
sole$off <- log(sole$a.1-sole$a.0)
sole$a<-(sole$a.1+sole$a.0)/2 
solr <- sole
solr$t <- solr$t-mean(sole$t)
solr$t <- solr$t/var(sole$t)^0.5
solr$la <- solr$la-mean(sole$la)

solr$lo <- solr$lo-mean(sole$lo)
solr$station <- factor(with(solr,paste(-la,-lo,-t,sep="")))

som <- gam(eggs ~ te(lo, la, t, bs = c('tp','tp'), k = c(25, 5), d = c(2,1)) + 
           s(t, k = 5, by = a) + s(station, bs = 're') + offset(off),
           family = quasipoisson, data = solr, method = 'REML')

Then summary(som) gives a test based on a likelihood ratio test as suggested by @BenBolker, but the reference distribution is corrected for testing on the boundary of the parameter space.

> summary(som)

Family: quasipoisson 
Link function: log 

Formula:
eggs ~ te(lo, la, t, bs = c("tp", "tp"), k = c(25, 5), d = c(2, 
    1)) + s(t, k = 5, by = a) + s(station, bs = "re") + offset(off)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.4016     0.3061  -11.11   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                edf  Ref.df      F  p-value    
te(lo,la,t)  56.025  65.456  2.547 4.62e-10 ***
s(t):a        4.535   4.886 54.790  < 2e-16 ***
s(station)  128.563 388.000  1.175  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.833   Deviance explained =   88%
-REML = -7.9014  Scale est. = 0.58148   n = 1575

I was having trouble getting the model without the random effect to converge using gamm() so I was unable to test the random effect term and even encountered an error when trying the multi-model form of anova().

If you want to get the random effects, using the gam() model, you can use my gratia package (hopefully on CRAN in a few days but which can be installed from github as shown above), and then:

> evaluate_smooth(som, 's(station)')
# A tibble: 394 x 5
   smooth    by_variable station                                       est    se
   <chr>     <fct>       <chr>                                       <dbl> <dbl>
 1 s(statio… NA          -0.0004304761904734280.419685714285714-… -0.0396   2.55
 2 s(statio… NA          -0.0004304761904734280.6586857142857140…  1.48     1.20
 3 s(statio… NA          -0.0004304761904734281.15968571428571-1… -0.00606  2.63
 4 s(statio… NA          -0.0004304761904734281.176685714285710.… -0.0767   2.48
 5 s(statio… NA          -0.002430476190475870.9096857142857141.… -0.00654  2.63
 6 s(statio… NA          -0.01243047619047390.4106857142857140.0… -0.802    1.61
 7 s(statio… NA          -0.0154304761904740.631685714285714-0.4… -0.138    2.35
 8 s(statio… NA          -0.02043047619047660.375685714285714-0.… -0.426    1.94
 9 s(statio… NA          -0.02543047619047911.14668571428571-0.4… -0.0333   2.57
10 s(statio… NA          -0.02743047619047450.875685714285714-0.… -0.0673   2.49
# … with 384 more rows

and you want the est column.

An offset is a term in the model that has a fixed effect of 1. In this instance it is being used to standardise the count response so that each you are comparing like for like; it is being used in this instance to integrate over the ages of eggs found in this sample. Read p. 143 of the 2nd ed of Simon's GAM book to see more about what is being done for this model and what the offset means.

More generally, say you sample a river with two nets; one net has twice the area than the other. You are more likely to capture more things in the larger net and hence the counts from the larger net will be higher on account of the greater sampling effort — you swept more of the river with the bigger net (assuming you sampled for the same amount of time). To make sure that you account for this difference in effort, you can include an offset in the model. The offset will be (for a Poisson model with log link) offset(log(net_area)). We have to include the offset on the link scale, hence the log(). Now what we are modelling is the count per unit area of net.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Hi, thank you for your help. I wonder what `est` means? How can I address the significance of the random effect using your package? Thanks. – Yang Yang Jan 21 '19 at 04:31
  • The `est` column is the estimated effects for each station, the `se` column is the standard error of that effect. I already showed how to test the random effect - look at the output from `summary()` for the line `s(station)`. If you are talking about individual effects from the `est` column, I don't know any other way for GAM(M)s than to simply form a 95% confidence interval on `est` (add and subtract 2 times `se` from `est`), as bootstrapping (non parametric) is not advised for smoothers. – Gavin Simpson Jan 21 '19 at 15:23
  • @GavinSimpson, are you at all worried about doing a likelihood ratio test for a quasi-likelihood model? It's underdispersed, so it's going to be hard to convert to a standard likelihood model (e.g. negative binomial is not an option; we need generalized gamma or COM-Poisson or something else exotic ...) I would probably work a little harder to convince the OP that maybe they don't need a significance test of the random effect ... – Ben Bolker Jan 22 '19 at 15:51
  • 1
    @BenBolker Yes; I'd prefer this was modelled some other way - my comment in the last sentence got pushed out the way as I answered other parts of this. Simon Wood and Matteo Fasiolo actually model these data with a Tweedie distribution in more recent work (including via a location scale Tweedie, but as yet this isn't in mgcv as it currently only works with their extended Fellner-Schall method for smoothness selection, which is not default). My personal preference would be to just include the random effect and not bother testing it. – Gavin Simpson Jan 22 '19 at 17:12
  • @BenBolker I wanted to take a look at what Simon says about this and he's less reticent in the use of a GLRT that uses the log quasi likelihood. I've added something in the answer which I hope addresses some of these points(?) – Gavin Simpson Jan 22 '19 at 17:29
  • OK. Does he suggest a test based on chi-squared or F? (`?anova.glm and [Venables and Ripley](https://books.google.ca/books?id=CzwmBQAAQBAJ&pg=PA208&dq=quasipoisson+F+test+venables+ripley&hl=en&sa=X&ved=0ahUKEwj15Y6ckoLgAhUINd8KHRT8A7wQ6AEIMTAB#v=onepage&q=quasipoisson%20F%20test%20venables%20ripley&f=false) suggest F tests in this case; VR call this an "ad hoc" adjustment ... – Ben Bolker Jan 22 '19 at 19:42
  • @BenBolker Simon uses a chi-square, but notes that the derivation he uses is different to that which would yield a Wald test. But beyond that I'd need to look into this and V&R. – Gavin Simpson Jan 22 '19 at 21:07