1

I am trying to apply the expectation-maximization algorithm to estimate missing count data but all the packages in R, such as missMethods, assume a multivariate Gaussian distribution. How would I apply the expectation-maximization algorithm to estimate missing count data assuming a Poisson distribution?

Say we have data that look like this:

x <- c(100,  96,  79, 109, 111,  NA,  93,  95, 119,  90, 121,  96,  NA,  
       NA,  85,  95, 110,  97,  87, 104, 101,  87,  87,  NA,  89,  NA, 
       113,  NA,  95,  NA, 119, 115,  NA, 105,  NA,  80,  90, 108,  90,  
       99, 111,  93,  99,  NA,  87,  89,  87, 126, 101, 106)

Applying impute_EM using missMethods (missMethods::impute_EM(x, stochastic = FALSE)) gives an answer but the data are not continuous but discrete.

I understand that questions like these require a minimum, reproducible example, but I honestly do not know where to start. Even suggested reading to point me in the right direction would be helpful.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Constantin
  • 132
  • 9
  • Wouldn't you expect that the result of an imputation for count data would be "discrete"? After all you would in many cases be submitting that output to functions which will at the very least be giving you warnings if you have non-integer inputs and at worst simply erroring out. – IRTFM Feb 09 '22 at 17:01
  • I apologize but I am a little confused with your comment. Imputation based on the mean or some other statistic is not doing the same thing as expectation maximization. I wanted to use a more rigorous method to estimate missing values before submitting it to another analysis. – Constantin Feb 10 '22 at 22:08

1 Answers1

1

Defining x0:

x0 <- x[!is.na(x)]

The Jeffreys/reference prior for a Poisson distribution with mean lambda is 1/sqrt(lambda). From the observed values, this results in lambda having a gamma reference posterior with a shape parameter sum(x0) + 0.5 and a rate parameter 1/length(x0). You could take n samples of lambda with:

lambda <- rgamma(n, sum(x0) + 0.5, length(x0))

Then sample n missing values (xm) with

xm <- rpois(n, lambda)

Alternatively, since a Gamma-Poisson compound distribution can be formulated as a negative binomial (after integrating out lambda):

xm <- rnbinom(n, sum(x0) + 0.5, length(x0)/(length(x0) + 1L))

As a function:

MI_poisson <- function(x, n) {
  x0 <- x[!is.na(x)]
  rbind(matrix(x0, ncol = n, nrow = length(x0)),
        matrix(rnbinom(n*(length(x) - length(x0)), sum(x0) + 0.5, length(x0)/(length(x0) + 1L)), ncol = n))
}

This will return a matrix with n columns where each column contains the original vector x with all NA values imputed. Each column could be used separately in further analysis, then the results can be aggregated.

jblood94
  • 10,340
  • 1
  • 10
  • 15
  • 2
    You do realize this is not EM imputation yet the question clearly talks of EM imputation. – Onyambu Feb 09 '22 at 17:48
  • @Onyambu wouldn't EM result in the mean of the non-NA values since it's a single sample space and the ML is the mean? – jblood94 Feb 09 '22 at 19:23
  • The issue is that this method is not capturing the inherent stochasticity present in the missing data. This is why I wanted to use EM. Is it that difficult to adapt the EM algorithm to an exponential distribution that is not normal? – Constantin Feb 10 '22 at 22:10
  • Are you looking to get a single value to replace all the `NA` values (single imputation), or do you want each `NA` to be replaced independently with a random value (multiple imputation)? – jblood94 Feb 11 '22 at 15:03
  • I would need to have each NA replaced independently with a random value, so multiple imputations I guess. – Constantin Feb 11 '22 at 15:33
  • 1
    My understanding is that EM is used for single imputation. See the modified answer for a multiple imputation solution. – jblood94 Feb 11 '22 at 15:59
  • @jblood94 this is really helpful, thank you. How would I do this using expectation maximization like in this question https://stats.stackexchange.com/questions/405526/em-algorithm-for-poisson-gamma ? I think I may be confused about the difference between imputation and impute.EM is actually doing? – Constantin Feb 11 '22 at 19:44
  • For that question, the dataset is known to be Poisson-distributed, but the rate parameter is known only to be a random variate from a gamma distribution with a known shape parameter and unknown scale parameter. The question is deriving the EM algorithm for finding the unknown gamma scale parameter. For your question, the unknown parameter is the Poisson rate parameter. The EM algorithm will converge to the mean of the observed data points. – jblood94 Feb 11 '22 at 21:12
  • You could use that value from the EM algorithm to sample for the `NA` values with `rpois`, but that will not take into account uncertainty about the rate parameter. The solution above does. – jblood94 Feb 11 '22 at 21:14
  • Great, thank you for your help. I have accepted your answer. – Constantin Feb 16 '22 at 17:39