Probably a very simple question, but I am banging my head against the wall with this. To start with I am not very familiar with r or stats beyond the basics. I work for an NGO with an endangered species, so we have very little in the way of resources. I am trying to determine the trend of a population while accounting for the very patchy data.
I have about 20 years of data. Each year, volunteers go to roost sites to count the birds leaving roosts in the morning. There is a lot of variation from year to year regarding which, and how many sites are counted. I have also gathered other metrics that I believe may have an effect on the numbers counted at any one time, such as moon phase (days until/since nearest full moon), cumulative precipitation over previous 1, 3, 6, 12 & 24 months. Along with Year and effort these make up my independent variables.
My understanding is that I should use a GLM to see to what degree each variable effects the dependent variable (total counted), with the idea that I can see whether the population is really increasing rather than the general increase in total counted simply being down to increased effort over the years.
I have played around with r and spent many hours googling and it seems that GLM was the right way to go, but I struggled to find what model best described the relationship. I was then introduced to negative binomials (rather than a quasi-poisson) GLM that produced an AIK that would tell me about the best fit.
I was then introduced to the MumIn dredge function that basically shows me that the top results with a delta of less than 2 are the best descriptors.
My problem now is that I am so far down a rabbit hole that I don't understand at all, I don't know if I'm even looking at the right information any more. So to start with the basics, is a negative binomial the right GLM to use in my case?
> dput(head(totals))
structure(list(Year = c(2002L, 2003L, 2005L, 2006L, 2007L, 2008L
), Total = c(433L, 627L, 141L, 714L, 609L, 429L), Effort = c(10L,
13L, 14L, 25L, 27L, 21L), Rain.24 = c(957.45, 867.23, 1408.05,
1634.91, 1127.47, 859.42), Rain.12 = c(426.52, 440.71, 878.8,
756.11, 371.36, 488.06), Rain.6 = c(321.72, 272.84, 639.16, 542,
250.71, 395.59), Rain.3 = c(157.94, 65.25, 437.35, 351.1, 86.94,
129.66), Rain.1 = c(27.94, 8.74, 99.3, 70.79, 25.8, 21.05), Nearest.full.moon = c(2L,
7L, 4L, 14L, 6L, 4L)), row.names = c(NA, 6L), class = "data.frame")
My dependant variable is "tot" with the following distribution:
> tot
[1] 290 433 870 277 714 669 429 479 860 547 654 865 845 1085 883 583 1023 1097 1182
[20] 945
> skewness(tot)
[1] -0.1372469
> var(tot)
[1] 73319.84
> mean(tot)
[1] 736.5
Here is the code that I am running for the main chunk of the analysis:
options(na.action=na.omit)
model3 <- glm.nb( Total ~ . , data = totals)
summary(model3)
options(na.action=na.fail)
dredge(model3)
res <- dredge(model3, trace=2)
subset(res, delta <= 2, recalc.weights=FALSE)
options(na.action=na.omit)
summary(model.avg(res, revised.var=FALSE))
importance(res)
Sorry if this make no sense, I will try and edit the question accordingly with some feedback.