1

I just learned that MASS::fitdistr, when fitting a negative binomial, is sensitive to the number of zeroes... a bummer since I was hoping to fit this distribution to count data of species where the number of zeroes is unknown and I'd argue unknowable. My goal here is to fit a negative binomial given only the positive (nonzero) part of the distribution... and have confidence that on simulated data, it would return (approximately) the simulated parameter values. I'm not dedicated to using MASS::fitdistr. Thanks for any suggestions.

# function to fit neg binomial to abundances of species at the per-site level

nbpar <- function(ab){
  MASS::fitdistr(ab, densfun = "Negative Binomial"
               , lower=c(1e-9, 1e-9))}

# simulate an abundance vector
set.seed(100)
site_abundance<-rnbinom(667, size = 0.4, mu = 30)

# fit the distribution and get simulation parameters back out
nbpar(site_abundance) # returns something very close to simulated parameters

# fit again with zeros omitted
nbpar(site_abundance[site_abundance>0]) # Oh Snap, gives nonsense... at least in the sense that the estimated parameters are pretty far off from the inputs!

UPDATE: For me, a fitting algorithm would work if it returned (in a fairly accurate, low-bias way) parameters that fit the complete data (including 0s), based only on the truncated (strictly positive) data. An expanded example of why MASS::fitdistr isn't what I want:

# first few lines are example as above
##########
##########
library(ggplot2)
# function to fit neg binomial to abundances of species at the per-site level

nbpar <- function(ab){
  MASS::fitdistr(ab, densfun = "Negative Binomial"
               , lower=c(1e-9, 1e-9))}

trunc<-function(x){x[x>0]}

# simulate an abundance vector
set.seed(100)

# slightly more abstract than first example

trials<-667
size = 0.4
mu = 30

site_abundance <-rnbinom(n = trials, size = size, mu = mu)



# fit the distribution and get simulation parameters back out
nbpar(site_abundance) # returns something very close to simulated parameters

# fit again with zeros omitted
x<-nbpar(site_abundance[site_abundance>0]) # different parameters

##############
##############
# new stuff
# I suspected the parameters drift upwards 
# if I do this iteratively, and a quick test 
# showed I was right

drift <- data.frame()
for(driftSteps in c(1:40)){
  mypar <- nbpar(trunc(site_abundance))
  size <- mypar$estimate[[1]]
  mu <- mypar$estimate[[2]]
  site_abundance <- rnbinom(n = trials, size = size, mu = mu) 
  drift[driftSteps,"driftSteps"]<- driftSteps
  drift[driftSteps,"size"]<- size
  drift[driftSteps,"mu"]<- mu
  
}


drift %>% ggplot(aes(driftSteps, mu)) +
  geom_point() +
  theme_classic()
Michael Roswell
  • 1,300
  • 12
  • 31
  • If I change it to `site_abundance<-rnbinom(667, size = 0.01, mu = 30)` I end up fitting a distribution without 0's that is >10x the simulated mean parameter. Feeding the simulated parameters vs. the estimated back in to rnbinom gives very different distributions. – Michael Roswell Jun 17 '22 at 18:31
  • ``` hist(rnbinom(667, size = 0.01, mu =30, breaks = 20)); hist(rnbinom(667, size = 0.29, mu = 365, breaks = 20))``` – Michael Roswell Jun 17 '22 at 18:31
  • Thanks, I like the comment/answer below, but I do feel like the "fitting quite well" is not to be counted on here. If you do this a few times (simulate, truncate zeroes, fit, simulate (based on fitted param values), truncate, fit etc.) the parameters drift (is it the case they will always drift upwards?) and I'd like something that looks a bit more consistent if it is possible to do so. – Michael Roswell Jun 17 '22 at 18:50
  • Are you looking for something that could be inspired by [this R-bloggers post](https://www.r-bloggers.com/2020/08/generating-data-from-a-truncated-distribution/)? The question seems more and more a question to [Cross Validated](https://stats.stackexchange.com/). – Rui Barradas Jun 17 '22 at 18:59
  • 1
    @RuiBarradas, I glanced at the R-blogger post and that seems in line with where this is going! I agree this might be more of statsy-theoretical question than I first hoped and might think more about how to frame the problem for CV, or myself for an ambitious coding session that likely starts with the link you shared. Thanks so much for thinking this through together! – Michael Roswell Jun 17 '22 at 19:08

1 Answers1

0

This is posted as an answer because it's too long to fit in a comment.
I do not see a problem with the parameters estimates, they fit the data rather well as can be seen from the plot below.
Also note that the zeros represent 19% of the data, without them the parameters estimates must be different than those used in the data generation process.

# function to fit neg binomial to abundances of 
# species at the per-site level
nbpar <- function(ab){
  MASS::fitdistr(ab, densfun = "Negative Binomial", lower=c(1e-9, 1e-9))
}

# simulate an abundance vector
set.seed(100)
site_abundance <- rnbinom(667, size = 0.4, mu = 30)

# fit with zeros omitted
pars <- nbpar(site_abundance[site_abundance > 0])

mean(site_abundance == 0)
#> [1] 0.1904048

empiric_dens <- proportions(table(site_abundance[site_abundance > 0]))
barplot(empiric_dens)
curve(dnbinom(x, size = pars$estimate[1], mu = pars$estimate[2]), 
      from = 0, to = 300, col = "red", lwd = 2, add = TRUE)

Created on 2022-06-17 by the reprex package (v2.0.1)

Michael Roswell
  • 1,300
  • 12
  • 31
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66