I have measurements obtained from 2 groups (a and b) where each group has the same 3 levels (x, y, z). The measurements are counts out of totals (i.e., rates), but in group a there cannot be zeros whereas in group b there can (hard coded in the example below).
Here's my example data.frame
:
set.seed(3)
df <- data.frame(count = c(rpois(15,5),rpois(15,5),rpois(15,3),
rpois(15,7.5),rpois(15,2.5),rep(0,15)),
group = as.factor(c(rep("a",45),rep("b",45))),
level = as.factor(rep(c(rep("x",15),rep("y",15),rep("z",15)),2)))
#add total - fixed for all
df$total <- rep(max(df$count)*2,nrow(df))
I'm interested in quantifying for each level
x,y,z if there is any difference between the (average) measurements of a and b? If there is, is it statistically significant?
From what I understand a Poisson GLM for rates seems to be appropriate for these types of data. In my case it seems that perhaps a negative binomial GLM would be even more appropriate since my data are over dispersed (I tried to create that in my example data to some extent but in my real data it is definitely the case).
Following the answer I got for a previous post I went with:
library(dplyr)
library(MASS)
df %>%
mutate(interactions = paste0(group,":",level),
interactions = ifelse(group=="a","a",interactions)) -> df2
df2$interactions = as.factor(df2$interactions)
fit <- glm.nb(count ~ interactions + offset(log(total)), data = df2)
> summary(fit)
Call:
glm.nb(formula = count ~ interactions + offset(log(total)), data = df2,
init.theta = 41.48656798, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.40686 -0.75495 -0.00009 0.46892 2.28720
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02047 0.07824 -25.822 < 2e-16 ***
interactionsb:x 0.59336 0.13034 4.552 5.3e-06 ***
interactionsb:y -0.28211 0.17306 -1.630 0.103
interactionsb:z -20.68331 2433.94201 -0.008 0.993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(41.4866) family taken to be 1)
Null deviance: 218.340 on 89 degrees of freedom
Residual deviance: 74.379 on 86 degrees of freedom
AIC: 330.23
Number of Fisher Scoring iterations: 1
Theta: 41.5
Std. Err.: 64.6
2 x log-likelihood: -320.233
I'd expect the difference between a and b for level z to be significant. However, the Std. Error
for level z seems enormous and hence the p-value is nearly 1.
My question is whether the model I'm using is set up correctly to answer my question (mainly through the use of the interactions
factor?)