1

When I conducted data analysis, I encountered an serious problem.

I want to research how a Treatment intensity would influence response variable y. The field investigation was conduced at different SITES on two years in long time ago. According to my knowledge, it’s not so good to group_by SITES and YEAR to acquire mean or sum value of y, especially excluding sum due to several observations were lost at different SITES.

Then I tried to build a negative binomial model with glmmTMB and check with DHARMa package: I got not so good residuals pattern.

library(glmmTMB)
model <- glmmTMB::glmmTMB(y~intensity+YEAR+(1|SITES),data = f,family = 'nbinom2')

res = simulateResiduals(model)
res %>% plot() 

And I tried to check whether it violates assumption of spatial auto-correlation with DHARMa too. It shows significant violation to the assumption( p-value = 0.0083). So, I tried to resolve the problem.

res2 <- recalculateResiduals(res, group = f$SITES)
testSpatialAutocorrelation(res2, 
                           x = f %>% group_by(SITES) %>% summarise(lat=mean(lat)) %>%  pull(lat),
                           y = f %>% group_by(SITES) %>% summarise(long=mean(long)) %>%  pull(long)
)



    DHARMa Moran's I test for distance-based autocorrelation

data:  res2
observed = 0.1135, expected = -0.0714, sd = 0.0701, p-value = 0.0083
alternative hypothesis: Distance-based autocorrelation

Because a model creating through the glmmTMB allowed to add coordinates to variance structure. So, I tried to add coordinates to above model with 3 methods as below:

  • mat()
  • gau()
  • exp()
temp$pos <- numFactor(f$long,f$lat)  # coordinate
# mat
FormA1 = formula(y ~ intensity + YEAR + mat(pos + 0 | SITES))
FormA2 = formula(y ~ intensity + YEAR + (1|SITES) + mat(pos + 0 | SITES))
# gau
FormB1 = formula(y ~ intensity + YEAR + gau(pos + 0 | SITES))
FormA2 = formula(y ~ intensity + YEAR + (1|SITES) + gau(pos + 0 | SITES))
# exp
FormA1 = formula(y ~ intensity + YEAR + exp(pos + 0 | SITES))
FormB2 = formula(y ~ intensity + YEAR + (1|SITES) + exp(pos + 0 | SITES))





model <- glmmTMB::glmmTMB(FormA1,  #FormA,FormB,FormC I tried one by one.
                            data = f,
                            family = 'nbinom2')


But the results were there’re only exp() variance structure could be running. The variance structures of gau() and mat() would tell the warnings:

Warning in fitTMB(TMBStruc) :
  Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
Warning in fitTMB(TMBStruc) :
  Model convergence problem; false convergence (8). See vignette('troubleshooting')

Even though the exp() could be running, when I tried to check spatial correlation, there was also significant violation to the assumption(p-value = 0.0087).

So, I gave up the method that I add coordinates to a variance structure of glmmTMB and turned to glmmPQL. Because , it could give correlation variance structure for MASS::glmmPQL and similar to gls() and lme() in nlme package.

The correlation structure are:

  • corSpher
  • corSpatial
  • corGau
  • corExp
  • corRatio
  • corLin

However, when I tried to run the code as below:

model <- MASS::glmmPQL(y ~ intensity, random = ~ 1|SITES,family = negative.binomial(theta = 0.9),
    correlation = corExp(form = ~  long+lat|SITES),  #corExp,corGau, corSpher, corLin ,corRatio, all I tried
    data = f)  

But they always show error: Error in getCovariate.corSpatial(object, data = data):cannot have zero distances in "corSpatial".

Considering my field investigation conditions, someone must consider to group my data and to obtain sum or mean of y, but the problems are:

  1. group YEAR and SITES to sum of y must yield bias error due to different observations within groups.
  2. It might be suitable when group by to get mean of y, but I don’t want to narrow the number of observations..

I have looked for a lot of posts in stackoverflow and other websites about how to resolve the problem: cannot have zero distances in "corSpatial". But I didn’t get good resolution including someone said: The barrier is repeated coordinates in the data. Sure, he said right, but I had put lon + lat|SITES at correlation structure, it also showed cannot have zero distances in "corSpatial". So, I realized I must look for help from Internet.

So, the question I want to say is : According to my data, how to create a good mixed linear model not to violation to residuals pattern and spatial auto-correlation.

Of course , you could use other models but better not to use Bayesians because it’s really strange for me.

If you really want to help me , I wish you might post your code so that I can run it again rather than type a little sentences in here. I’ve upload part of my original data so that support convenience to you . I really really want to get helps from all of friends on internet.

Thanks in advance at 3 o’clock on mid-night on my time zone.

Here is my data:

f  = 
structure(list(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("2000", 
"2001"), class = "factor"), SITES = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L, 11L, 11L, 14L, 14L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 
15L, 15L, 15L, 15L, 15L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 
13L, 13L, 13L), .Label = c("A1", "A10", "A11", "A12", "A17", 
"A18", "A20", "A3", "A8", "A9", "B1", "B14", "B16", "B3", "B6"
), class = "factor"), long = c(80.41081194, 80.41081194, 80.41081194, 
80.41081194, 80.41081194, 80.41081194, 80.41081194, 80.41081194, 
80.41081194, 80.48462682, 80.48462682, 80.48462682, 80.48462682, 
80.48462682, 80.48462682, 80.48462682, 80.48462682, 80.48462682, 
80.57384333, 80.57384333, 80.57384333, 80.57384333, 80.57384333, 
80.57384333, 80.57384333, 80.57384333, 80.57384333, 80.48125, 
80.48125, 80.48125, 80.48125, 80.48125, 80.48125, 80.48125, 80.48125, 
80.48125, 80.56281944, 80.56281944, 80.56281944, 80.56281944, 
80.56281944, 80.56281944, 80.56281944, 80.56281944, 80.56281944, 
80.46875, 80.46875, 80.46875, 80.46875, 80.46875, 80.46875, 80.46875, 
80.46875, 80.46875, 80.41125278, 80.41125278, 80.41125278, 80.41125278, 
80.41125278, 80.41125278, 80.41125278, 80.41125278, 80.41125278, 
80.30014935, 80.30014935, 80.30014935, 80.30014935, 80.30014935, 
80.30014935, 80.30014935, 80.30014935, 80.30014935, 80.26167842, 
80.26167842, 80.26167842, 80.26167842, 80.26167842, 80.26167842, 
80.26167842, 80.26167842, 80.26167842, 80.20558138, 80.20558138, 
80.20558138, 80.20558138, 80.20558138, 80.20558138, 80.20558138, 
80.20558138, 80.20558138, 80.349776, 80.349776, 80.349776, 80.349776, 
80.349776, 80.349776, 80.349776, 80.349776, 80.349776, 80.349776, 
80.349776, 80.349776, 80.373821, 80.373821, 80.373821, 80.373821, 
80.373821, 80.373821, 80.373821, 80.373821, 80.373821, 80.373821, 
80.373821, 80.373821, 80.434011, 80.434011, 80.434011, 80.434011, 
80.434011, 80.434011, 80.434011, 80.434011, 80.434011, 80.434011, 
80.434011, 80.434011, 80.520748, 80.520748, 80.520748, 80.520748, 
80.520748, 80.520748, 80.520748, 80.520748, 80.520748, 80.520748, 
80.520748, 80.520748, 80.507795, 80.507795, 80.507795, 80.507795, 
80.507795, 80.507795, 80.507795, 80.507795, 80.507795, 80.507795, 
80.507795, 80.507795), lat = c(41.17633167, 41.17633167, 41.17633167, 
41.17633167, 41.17633167, 41.17633167, 41.17633167, 41.17633167, 
41.17633167, 41.12161667, 41.12161667, 41.12161667, 41.12161667, 
41.12161667, 41.12161667, 41.12161667, 41.12161667, 41.12161667, 
40.97858194, 40.97858194, 40.97858194, 40.97858194, 40.97858194, 
40.97858194, 40.97858194, 40.97858194, 40.97858194, 41.03570917, 
41.03570917, 41.03570917, 41.03570917, 41.03570917, 41.03570917, 
41.03570917, 41.03570917, 41.03570917, 40.80911056, 40.80911056, 
40.80911056, 40.80911056, 40.80911056, 40.80911056, 40.80911056, 
40.80911056, 40.80911056, 40.80239444, 40.80239444, 40.80239444, 
40.80239444, 40.80239444, 40.80239444, 40.80239444, 40.80239444, 
40.80239444, 40.82870833, 40.82870833, 40.82870833, 40.82870833, 
40.82870833, 40.82870833, 40.82870833, 40.82870833, 40.82870833, 
41.02665248, 41.02665248, 41.02665248, 41.02665248, 41.02665248, 
41.02665248, 41.02665248, 41.02665248, 41.02665248, 40.99086151, 
40.99086151, 40.99086151, 40.99086151, 40.99086151, 40.99086151, 
40.99086151, 40.99086151, 40.99086151, 41.00486149, 41.00486149, 
41.00486149, 41.00486149, 41.00486149, 41.00486149, 41.00486149, 
41.00486149, 41.00486149, 41.021529, 41.021529, 41.021529, 41.021529, 
41.021529, 41.021529, 41.021529, 41.021529, 41.021529, 41.021529, 
41.021529, 41.021529, 40.970129, 40.970129, 40.970129, 40.970129, 
40.970129, 40.970129, 40.970129, 40.970129, 40.970129, 40.970129, 
40.970129, 40.970129, 40.85327, 40.85327, 40.85327, 40.85327, 
40.85327, 40.85327, 40.85327, 40.85327, 40.85327, 40.85327, 40.85327, 
40.85327, 41.045005, 41.045005, 41.045005, 41.045005, 41.045005, 
41.045005, 41.045005, 41.045005, 41.045005, 41.045005, 41.045005, 
41.045005, 41.083288, 41.083288, 41.083288, 41.083288, 41.083288, 
41.083288, 41.083288, 41.083288, 41.083288, 41.083288, 41.083288, 
41.083288), y = c(36L, 39L, 45L, 167L, 150L, 113L, 74L, 80L, 
72L, 197L, 249L, 212L, 195L, 173L, 105L, 43L, 44L, 27L, 80L, 
69L, 52L, 306L, 308L, 188L, 123L, 106L, 121L, 27L, 32L, 26L, 
197L, 232L, 245L, 114L, 110L, 170L, 12L, 46L, 57L, 254L, 332L, 
207L, 33L, 95L, 89L, 40L, 81L, 77L, 119L, 122L, 194L, 81L, 54L, 
71L, 34L, 29L, 33L, 136L, 129L, 189L, 42L, 38L, 69L, 53L, 39L, 
55L, 62L, 33L, 49L, 26L, 15L, 31L, 130L, 144L, 79L, 76L, 41L, 
86L, 21L, 25L, 16L, 104L, 115L, 109L, 69L, 98L, 84L, 26L, 19L, 
30L, 10L, 12L, 4L, 13L, 9L, 4L, 27L, 0L, 4L, 6L, 2L, 6L, 3L, 
1L, 0L, 1L, 1L, 1L, 6L, 3L, 5L, 0L, 5L, 2L, 0L, 1L, 2L, 2L, 0L, 
2L, 1L, 4L, 1L, 10L, 6L, 9L, 2L, 2L, 2L, 0L, 5L, 0L, 17L, 7L, 
20L, 33L, 34L, 45L, 1L, 2L, 0L, 2L, 2L, 2L, 5L, 6L, 3L, 12L, 
22L, 12L), intensity = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("low", 
"high"), class = "factor")), class = "data.frame", row.names = c(NA, 
-150L))
PeterPanPan
  • 151
  • 4
  • 1
    There's a lot here. Keep in mind the [mcve] guidance and see if you can narrow down the question, possibly by splitting into multiple posts. Additionally, questions that are more about statistics than code should be on [stats.se] instead; it seems like you have a mix of stats questions and code questions all in one post – camille Jan 12 '22 at 19:49
  • I will come back and post a response. The short answer is that if you look at the plot of residuals across space, there's a clear SW to NE gradient. Adding `lat` and `long` to the model solves the problem. (I assume that you don't have a problem doing that.) – Ben Bolker Jan 12 '22 at 23:37

1 Answers1

1

tl;dr the spatial pattern can be well explained by a spatial trend (x-y plane), you don't need an autocorrelation model. (If you did you might also try fitting this model in mgcv.)

Packages:

library(glmmTMB)
library(DHARMa)
library(ggplot2); theme_set(theme_bw())
library(tidyverse)
library(colorspace)

I plotted the data (not reproduced here, it didn't show anything amazing)

ggplot(f, aes(intensity, y, shape=YEAR, fill = SITES)) +
  geom_boxplot() + facet_wrap(~YEAR)

There are 15 sites, thus only 15 unique lat/long positions. Let's plot the average values of the residuals by site, in space (there are other ways to do this, but augment is convenient)

a <- broom.mixed::augment(model, data = f)
aa <- (a
  %>% group_by(long, lat)
  %>% summarize(across(.resid, mean), .groups = "drop")
)
ggplot(aa, aes(long, lat, colour = .resid),
              size = abs(.resid))) +
  geom_point() +
  scale_color_continuous_diverging() +
  scale_size(range = c(3, 8))

plot of residuals in blue/red, showing strong SW/NE structure

Plotting with sign(.resid) instead makes the pattern even clearer:

plot of residuals again

almost all the negative residuals are in the SW, almost all the positive residuals are in the NE.

Solution: add lat and long to the model as predictors.

model2 <- update(model, . ~. + lat + long)

res2 <- simulateResiduals(model2)
res3 <- recalculateResiduals(res2, group = f$SITES)
locs <- f %>% group_by(SITES) %>% summarise(across(c(lat, long), mean))
testSpatialAutocorrelation(res3, x = locs$long, y = locs$lat)

This takes care of the significant autocorrelation, and makes the overall diagnostics look better.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks ben. You are a Statistician who I mostly respect. I make a decision to accept the method you advice. while I still wonder if it's reasonable to add coordinates to variance structure like `correlation = corSpher(form = ~long+lat|SITES)` for my data. Could I think my hierarchical data structure won't support correlation structure like above. If so, I will never to consider this type of structure because my investigation design always yield this kind of data. If it supports the structure, how I should code it. I sincerely hope you can give me a answer. Thank you in advance. – PeterPanPan Jan 13 '22 at 07:20