0

I have measurements obtained from 2 groups (a and b) where each group has the same 3 levels (x, y, z). The measurements are counts out of totals (i.e., rates), but in group a there cannot be zeros whereas in group b there can (hard coded in the example below). Here's my example data.frame:

set.seed(3)
df <- data.frame(count = c(rpois(15,5),rpois(15,5),rpois(15,3),
                           rpois(15,7.5),rpois(15,2.5),rep(0,15)),
                 group = as.factor(c(rep("a",45),rep("b",45))),
                 level = as.factor(rep(c(rep("x",15),rep("y",15),rep("z",15)),2)))

#add total - fixed for all
df$total <- rep(max(df$count)*2,nrow(df))

I'm interested in quantifying for each level x,y,z if there is any difference between the (average) measurements of a and b? If there is, is it statistically significant?

From what I understand a Poisson GLM for rates seems to be appropriate for these types of data. In my case it seems that perhaps a negative binomial GLM would be even more appropriate since my data are over dispersed (I tried to create that in my example data to some extent but in my real data it is definitely the case).

Following the answer I got for a previous post I went with:

library(dplyr)
library(MASS)

df %>%
  mutate(interactions = paste0(group,":",level),
         interactions = ifelse(group=="a","a",interactions)) -> df2
df2$interactions = as.factor(df2$interactions)
fit <- glm.nb(count ~ interactions + offset(log(total)), data = df2)

> summary(fit)

Call:
glm.nb(formula = count ~ interactions + offset(log(total)), data = df2, 
    init.theta = 41.48656798, link = log)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.40686  -0.75495  -0.00009   0.46892   2.28720  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -2.02047    0.07824 -25.822  < 2e-16 ***
interactionsb:x    0.59336    0.13034   4.552  5.3e-06 ***
interactionsb:y   -0.28211    0.17306  -1.630    0.103    
interactionsb:z  -20.68331 2433.94201  -0.008    0.993    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(41.4866) family taken to be 1)

    Null deviance: 218.340  on 89  degrees of freedom
Residual deviance:  74.379  on 86  degrees of freedom
AIC: 330.23

Number of Fisher Scoring iterations: 1


              Theta:  41.5 
          Std. Err.:  64.6 

 2 x log-likelihood:  -320.233 

I'd expect the difference between a and b for level z to be significant. However, the Std. Error for level z seems enormous and hence the p-value is nearly 1. My question is whether the model I'm using is set up correctly to answer my question (mainly through the use of the interactions factor?)

Community
  • 1
  • 1
user1701545
  • 5,706
  • 14
  • 49
  • 80
  • My guess would be that the algorithm is not converging correctly. This can occur when the values are all '0' for a particular factor, which causes numerical instability (e.g. the model can't discriminate between different very small values for the mean). Lastly and probably most critically, the initial variance estimate will be 0 for this factor. This creates a singularity which will make your standard error 'explode'. See what happens if you change the very last count in your data frame (with interaction b:z) to 1. I think this will then show this variable is significant. – Pash101 Nov 05 '15 at 15:23
  • Usually glms throw a warning when they do not converge, but that doesn't happen in my case. Regardless, adding a 'pseudocount' like you suggest does indeed stabilize the behavior and the error does not explode anymore. I guess this means that the implementation of glm (allowing for rates using the offset() option) does not account well for such zero inflation cases? Any idea of alternative models that do? or would adding such "pseudocounts" be the best solution? – user1701545 Nov 05 '15 at 16:27
  • I had seen this issue before when fitting logistic regression models where one of the variables created perfect separation in the response variable. I'm not sure of any alternative models for this regression, nor am I a real expert on this. My suggestions would be - 1) do nothing, the fact that this variable is shown as not significant is simply a fitting error (it likely is significant even with a pseudocount, so you can reason it must be without); 2) collect more data: if this doesn't fix the error, it will further reinforce your argument that this variable is clearly significant (as per 1) – Pash101 Nov 05 '15 at 17:31

0 Answers0