0

I am trying to determine the influence of five categorical and one continuous independent variable on some ecological count data using a GLM in R. Here is an example of what the data that I am using looks like:

No.  DateNum   Tunnel   Trt   Row   Time   AvgTemp   sqTotal.L 
  1    44382        1     A     3     AM      30.0        1.41
  2    44384        3     C     2     PM      21.0        2.23
  3    44384        7     D     3     AM      24.0        3.65
  4    44400        4     B     1     AM      27.5        2.78

The fixed effects DateNum, Tunnel and Row are coded as ordered factors, Trt and Time are unordered factors and AvgTemp is coded as a numeric. 'sqTotal.L' is the squart-root-transformed count data, which is a normally distributed response variable. I have decided to use GLM instead of ANOVA because the experimental design is not balanced and there are different numbers of samples from the different experimental plots.

When I run the following code for a GLM and then the drop1() function on the resulting model, the effect of my continuous fixed effect (AvgTemp) appears not be incorporated into the result of the drop1() function:

> rowfvs.glm <- glm(sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel + Trt, 
+    family = gaussian, data = rowfvs2)
> summary(rowfvs.glm)

Call:
glm(formula = sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel + 
    Trt, family = gaussian, data = rowfvs2)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.63548  -0.38868   0.06587   0.41777   1.31886  

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  8.89037    5.98768   1.485   0.1492  
AvgTemp     -0.28191    0.24566  -1.148   0.2612  
Row.L       -0.46085    0.24735  -1.863   0.0734 .
Row.Q        0.08047    0.25153   0.320   0.7515  
DateNum.L   -1.17448    0.85015  -1.382   0.1785  
DateNum.Q    0.57857    0.64731   0.894   0.3793  
DateNum.C   -2.17331    2.15684  -1.008   0.3226  
DateNum^4   -0.76025    1.09723  -0.693   0.4943  
DateNum^5   -1.62269    0.68388  -2.373   0.0250 *
DateNum^6    0.63799    0.70822   0.901   0.3756  
DateNum^7         NA         NA      NA       NA  
TimePM      -0.31436    0.87881  -0.358   0.7233  
Tunnel.L     1.38420    0.62199   2.225   0.0346 *
Tunnel.Q    -0.03521    0.56561  -0.062   0.9508  
Tunnel.C     0.81639    0.54880   1.488   0.1484  
Tunnel^4     0.24029    0.61180   0.393   0.6976  
Tunnel^5     0.30665    0.51849   0.591   0.5592  
Tunnel^6     0.67603    0.53728   1.258   0.2191  
TrtB         0.10067    0.40771   0.247   0.8068  
TrtC         0.31278    0.41048   0.762   0.4527  
TrtD        -0.49857    0.46461  -1.073   0.2927  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.7583716)

    Null deviance: 50.340  on 46  degrees of freedom
Residual deviance: 20.476  on 27  degrees of freedom
AIC: 136.33

Number of Fisher Scoring iterations: 2

> drop1(rowfvs.glm, test = "Chi")
Single term deletions

Model:
sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel + Trt
        Df Deviance    AIC scaled dev. Pr(>Chi)  
<none>       20.476 136.33                       
AvgTemp  0   20.476 136.33      0.0000           
Row      2   23.128 138.05      5.7249  0.05713 .
DateNum  6   25.517 134.67     10.3447  0.11087  
Time     1   20.573 134.55      0.2222  0.63736  
Tunnel   6   27.525 138.23     13.9039  0.03073 *
Trt      3   23.201 136.20      5.8725  0.11798  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

By contrast, when I try running the anova() function on the model, I do get an analysis of the influence of AvgTemp on sqTotal.L:

> anova(rowfvs.glm, test = "Chi")
Analysis of Deviance Table

Model: gaussian, link: identity

Response: sqTotal.L

Terms added sequentially (first to last)


        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                       46     50.340              
AvgTemp  1   0.7885        45     49.551 0.3078945    
Row      2   1.0141        43     48.537 0.5124277    
DateNum  6  17.6585        37     30.879 0.0007065 ***
Time     1   0.3552        36     30.523 0.4937536    
Tunnel   6   7.3222        30     23.201 0.1399428    
Trt      3   2.7251        27     20.476 0.3088504    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, my questions are why isn't the drop1() function taking AvgTemp into account, and is it sufficient to use the results from the anova() function in my report of the results, or do I need to figure out how to get the drop1() function to incorporate this continuous predictor?

D.Hodgkiss
  • 25
  • 3
  • I'm not sure whether this is an acceptable solution or not, but after doing a bit more research, I've tried recoding AvgTemp as a covariate by including it as an offset variable in cases where it was found to significantly influence the response variable. That at least allows me to run the emmeans() function to get least square means for the levels of significant fixed effects. – D.Hodgkiss Oct 26 '21 at 09:11

1 Answers1

1

This is a bit of a guess because we don't have your data, but: I believe the answer is related to the multicollinearity in your design matrix (as signalled by the message "1 not defined because of singularities" and the presence of an NA estimate for the DateNum^7 parameter).

When you have collinear (perfectly correlated) columns in your design matrix, it can be a bit unpredictable how they get dealt with. lm() picks one of the columns to discard: in this case it's DateNum^7. However, assuming that AvgTemp is also in the set of collinear columns, if you drop AvgTemp from the model then when the model gets refitted lm() will not drop DateNum^7 (because it doesn't need to any more), but you will still get the same goodness of fit (AIC/log-likelihood/etc.) — because you dropped a variable that is redundant.

You should be able to explore this possibility via caret::findLinearCombos(model.matrix(rowfvs.glm)), although careful thought about your observational/experimental design might also enlighten you as to why these variables are collinear ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453