I am learning how to use glms to test hypothesis and to see how variables relate among themselves. I am trying to see if the variable tick prevalence (Parasitized individuals/Assessed individuals)(dependent variable) is influenced by the number of captured hosts (independent variable).
My data looks like figure 1.(116 observations).
I have read that one way to know which distribution to use is to see which distribution the dependent variable has. So I built a histogram for the TickPrev variable (figure 2).
I got to the conclusion that the binomial negative distribution would be the best option. Before I ran the analysis, I transformed the TickPrevalence variable (it was a proportion, and the glm.nb only works with integers) applying the following codes:
df <- df %>% mutate(TickPrev=TickPrev*100)
df$TickPrev <- as.integer(df$TickPrev)
Then I applied the glm.nb function from the MASS package, and obtained this summary
summary(glm.nb(df$TickPrev~df$Captures, link=log))
Call:
glm.nb(formula = df15$TickPrev ~ df15$Captures, link = log, init.theta = 1.359186218)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.92226 -0.69841 -0.08826 0.44562 1.70405
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.438249 0.125464 27.404 <2e-16 ***
df15$Captures -0.008528 0.004972 -1.715 0.0863 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.3592) family taken to be 1)
Null deviance: 144.76 on 115 degrees of freedom
Residual deviance: 141.90 on 114 degrees of freedom
AIC: 997.58
Number of Fisher Scoring iterations: 1
Theta: 1.359
Std. Err.: 0.197
2 x log-likelihood: -991.584
I know that the p-value indicates that there isn't enough proves to believe that the two variables are related. However, I am not sure if I used the best model to fit the data and how I can know that. Can you please help me? Also, knowing what I show, is there a better way to see if this variables are related? Thank you very much.