4

I want to thank you in advance for your consideration of my problem.

I have what I naively thought to be a fairly straight forward problem that involves outlier detection for many different sets of count data. Specifically, I want to determine if one or more values in a series of count data is higher or lower than expected relative to the rest of the counts in the distribution.

The confounding factor is that I need to do this for 3,500 distributions and it is likely some of them will fit a zero inflated overdispersed poisson, while others may best fit a negative binomial or ZINB, while still others may be normally distributed. For this reason, simple Z-scores or plotting of the distribution are not appropriate for much of the dataset. Here is an example of the count data for which I want to detect outliers.

counts1=[1 1 1 0 2 1 1 0 0 1 1 1 1 1 0 0 0 0 1 2 1 1 2 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 0 1 2 1 1 1 1 1 1 0 1 1 2 0 0 0 1 0 1 2 1 1 0 2 1 1 1 0 0 1 0 0 0 2 0 1 1 0 2 1 0 1 1 0 0 2 1 0 1 1 1 1 2 0 3]

counts2=[0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0]

counts3=[14 13 14 14 14 14 13 14 14 14 14 14 15 14 14 14 14 14 14 15 14 13 14 14 15 12 13 17 13 14 14 14 14 15 14 14 13 14 13 14 14 14 14 13 14 14 14 15 15 14 14 14 14 14 15 14 1414 14 15 14 14 14 14 14 14 14 14 14 14 14 14 13 16]

counts4=[0 3 1.......]

and so on up to counts3500.

Initially I thought I would need to write a loop in Python or R that would apply a set of models to each distribution and select the best fitting model according to AIC or other (maybe the fitdistrplus in R?). I could then ask what were extremes for the given distribution (the counts that fall in the tails e.g. would a count of "4" be an outlier in the counts1 distribution above?). However, I am not sure this is a valid strategy, and it occurred to me there may be a simple methodology for determining outliers in count data of which I was not aware. I have searched extensively and found nothing that seems appropriate for my problem given the number of distributions I want to look at.

My ultimate goal is to detect significant increases or decreases in a count for each distribution of counts, using the most statistically appropriate methodology.

Once again, thank you for your time.

EDi
  • 13,160
  • 2
  • 48
  • 57
  • 1
    this might be a http://stats.stackexchange.com question ... – Ben Bolker Apr 17 '13 at 16:57
  • I agree but there is a surprising lack information on stats.stackexchange with regards to counts data and I often find that stats (or other) problems requiring programming solutions (in a language I use such as R or python) are often better addressed by savvy programmers (who work in R or python). I could be wrong about this but neither place seemed to have a good solution posted so I thought I would start in a forum that might provide both a stats solution and a programming solution in one fell swoop. thanks – Joe Gomphus Apr 17 '13 at 19:57

1 Answers1

0

The outliers package has good facility for this type of testing.

library(outliers)

x <- c(rep(c(0,1),1000),3)
chisq.test.out(x)

    chi-squared test for outlier

data:  x
X-squared = 24.6668, p-value = 6.815e-07
alternative hypothesis: highest value 3 is an outlier

> system.time(rep(chisq.out.test(x),3500))
   user  system elapsed 
  0.004   0.000   0.002 
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • Thank you but if I am not mistaken the tests in this package all assume normality which is the issue. I am more interested in detecting outliers with ZIP distributions, or ZINB etc. Thanks – Joe Gomphus Apr 19 '13 at 20:14
  • x^2 is non-parametric. So, less powerful but still considered to be robust. I don't work with ZIP or ZINB frequently, so recommend you ask on CV. – Brandon Bertelsen Apr 19 '13 at 20:25