I am trying to explore regressions between abundances and 3 variables. My data (test.gam)looks like this:
# A tibble: 6 x 5
Site Abundance SPM isotherm SiOH4
<chr> <dbl> <dbl> <dbl> <dbl>
1 cycle1 0.769 5960367. 102. 18.2
2 cycle1 0.632 6496360. 97.5 18.2
3 cycle1 0.983 5328652. 105 18.2
4 cycle1 1 6212034. 110 18.2
5 cycle1 0.821 5468987. 105 18.2
6 cycle1 0.734 5280549. 112. 18.2
In one of these variable (SiOH4), I have only one value per Site, while for the 2 other variables, I have single value for each station (each row being a station).
To plot the relation between abundances and SiOH4 I would simply compute a mean value for each Site. The relation show that there is a constant increase of abundances with SiOH4 levels: Plot1.
Now I tried running a GAM on this data using the following code:
mod_gam1 <- gam(Abundance ~ s(isotherm, bs = "cr", k = 5)
+ SPM + s(SiOH4, bs = "cr", k = 5), data = test.gam, family = gaussian(link = log), gamma = 1.4)
giving me these results:
Family: gaussian
Link function: log
Formula:
Abundance ~ s(isotherm, bs = "cr", k = 5) + SPM + s(SiOH4, bs = "cr",
k = 5)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.182e-01 8.244e-02 -9.925 < 2e-16 ***
SPM -4.356e-08 1.153e-08 -3.778 0.000219 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(isotherm) 2.019 2.485 10.407 1.46e-05 ***
s(SiOH4) 3.861 3.986 9.823 1.01e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.492 Deviance explained = 51.2%
GCV = 0.044202 Scale est. = 0.040674 n = 177
So I am happy quite happy about the results but then by checking with gam.check
, I find that k is too low.
Method: GCV Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-8.801477e-14,5.555545e-13]
(score 0.04420205 & scale 0.04067442).
Hessian positive definite, eigenvalue range [6.631202e-05,7.084933e-05].
Model rank = 10 / 10
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(isotherm) 4.00 2.02 0.85 0.01 **
s(SiOH4) 4.00 3.86 0.59 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that I voluntarily set my k to 5, otherwise there is an overfit of the pattern.
I thought that this might due to the fact that many of the values in SiOH4
are repeated. By modifying my data to keep only the first value of each Site (replacing all other rows with NA) like:
# A tibble: 6 x 5
# Groups: Site [1]
Site Abundance SPM isotherm SiOH4
<chr> <dbl> <dbl> <dbl> <dbl>
1 cycle1 0.769 5960367. 102. 18.2
2 cycle1 0.632 6496360. 97.5 NA
3 cycle1 0.983 5328652. 105 NA
4 cycle1 1 6212034. 110 NA
5 cycle1 0.821 5468987. 105 NA
6 cycle1 0.734 5280549. 112. NA
I hoped preventing this repeated levels. But this way I am also loosing most of my rows, with the na.omit
option on. However running the same GAM, I don't have problems with k being too low after using gam.check
.
So do I need to keep repetitive values and ignore the warning from gam.check
or there is a way to somehow keep all rows even if NA exist?