0

I am trying to explore regressions between abundances and 3 variables. My data (test.gam)looks like this:

# A tibble: 6 x 5
  Site   Abundance      SPM isotherm SiOH4
  <chr>      <dbl>    <dbl>    <dbl> <dbl>
1 cycle1     0.769 5960367.    102.   18.2
2 cycle1     0.632 6496360.     97.5  18.2
3 cycle1     0.983 5328652.    105    18.2
4 cycle1     1     6212034.    110    18.2
5 cycle1     0.821 5468987.    105    18.2
6 cycle1     0.734 5280549.    112.   18.2

In one of these variable (SiOH4), I have only one value per Site, while for the 2 other variables, I have single value for each station (each row being a station).

To plot the relation between abundances and SiOH4 I would simply compute a mean value for each Site. The relation show that there is a constant increase of abundances with SiOH4 levels: Plot1.

Now I tried running a GAM on this data using the following code:

mod_gam1 <- gam(Abundance ~ s(isotherm, bs = "cr", k = 5)
+ SPM + s(SiOH4, bs = "cr", k = 5), data = test.gam, family = gaussian(link = log), gamma = 1.4)

giving me these results:

Family: gaussian 
Link function: log 

Formula:
Abundance ~ s(isotherm, bs = "cr", k = 5) + SPM + s(SiOH4, bs = "cr", 
    k = 5)

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.182e-01  8.244e-02  -9.925  < 2e-16 ***
SPM         -4.356e-08  1.153e-08  -3.778 0.000219 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
              edf Ref.df      F  p-value    
s(isotherm) 2.019  2.485 10.407 1.46e-05 ***
s(SiOH4)    3.861  3.986  9.823 1.01e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.492   Deviance explained = 51.2%
GCV = 0.044202  Scale est. = 0.040674  n = 177

So I am happy quite happy about the results but then by checking with gam.check, I find that k is too low.

Method: GCV   Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-8.801477e-14,5.555545e-13]
(score 0.04420205 & scale 0.04067442).
Hessian positive definite, eigenvalue range [6.631202e-05,7.084933e-05].
Model rank =  10 / 10 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

              k'  edf k-index p-value    
s(isotherm) 4.00 2.02    0.85    0.01 ** 
s(SiOH4)    4.00 3.86    0.59  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that I voluntarily set my k to 5, otherwise there is an overfit of the pattern.

I thought that this might due to the fact that many of the values in SiOH4 are repeated. By modifying my data to keep only the first value of each Site (replacing all other rows with NA) like:

# A tibble: 6 x 5
# Groups:   Site [1]
  Site   Abundance      SPM isotherm SiOH4
  <chr>      <dbl>    <dbl>    <dbl> <dbl>
1 cycle1     0.769 5960367.    102.   18.2
2 cycle1     0.632 6496360.     97.5  NA  
3 cycle1     0.983 5328652.    105    NA  
4 cycle1     1     6212034.    110    NA  
5 cycle1     0.821 5468987.    105    NA  
6 cycle1     0.734 5280549.    112.   NA  

I hoped preventing this repeated levels. But this way I am also loosing most of my rows, with the na.omit option on. However running the same GAM, I don't have problems with k being too low after using gam.check.

So do I need to keep repetitive values and ignore the warning from gam.check or there is a way to somehow keep all rows even if NA exist?

TylerH
  • 20,799
  • 66
  • 75
  • 101
Rhiz
  • 23
  • 6
  • NA are actually automatically remove when running the GAM. I kind of solved the issue by measuring a mean value to fill up the gap when possible. – Rhiz Jul 04 '18 at 00:56
  • I think you'd be better off treating SiOH4 as (ordered) factor and perhaps having a different smooth per level. Or just replicate the SiOH4 values as required and refit, but I doubt you can do much with a smooth that you can't with an ordered factor when you have only a few unique values. – Gavin Simpson Jul 28 '18 at 11:26

0 Answers0