1

Suppose I have some data and I fit them to a gamma distribution, how to find the interval probability for Pr(1 < x <= 1.5), where x is an out-of-sample data point?

require(fitdistrplus)

a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)

fit <- fitdist(a, "gamma",lower = c(0, 0))
Jacob Socolar
  • 1,172
  • 1
  • 9
  • 23
Yang Yang
  • 858
  • 3
  • 26
  • 49

3 Answers3

3

Here goes an example that uses MCMC techniques and a Bayesian mode of inference to estimate the posterior probability that a new observation falls in the interval (1:1.5). This is an unconditional estimate, as opposed to the conditional estimate obtained by integrating the gamma-distribution with maximum-likelihood parameter estimates.

This code requires that JAGS be installed on your computer (free and easy to install).

library(rjags)

a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
       0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
       0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
       1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
       0.0475603,1.8657773,0.18307362,1.13519144)

# Specify the model in JAGS language using diffuse priors for shape and scale
sink("GammaModel.txt")
cat("model{

    # Priors
    shape ~ dgamma(.001,.001)
    rate ~ dgamma(.001,.001)

    # Model structure
    for(i in 1:n){
    a[i] ~ dgamma(shape, rate)
    }
    }
    ", fill=TRUE)
sink()

jags.data <- list(a=a, n=length(a))

# Give overdispersed initial values (not important for this simple model, but very important if running complicated models where you need to check convergence by monitoring multiple chains)
inits <- function(){list(shape=runif(1,0,10), rate=runif(1,0,10))}

# Specify which parameters to monitor
params <- c("shape", "rate")

# Set-up for MCMC run
nc <- 1   # number of chains
n.adapt <-1000   # number of adaptation steps
n.burn <- 1000    # number of burn-in steps
n.iter <- 500000  # number of posterior samples
thin <- 10   # thinning of posterior samples

# Running the model
gamma_mod <- jags.model('GammaModel.txt', data = jags.data, inits=inits, n.chains=nc, n.adapt=n.adapt)
update(gamma_mod, n.burn)
gamma_samples <- coda.samples(gamma_mod,params,n.iter=n.iter, thin=thin)
# Summarize the result
summary(gamma_samples)

# Compute improper (non-normalized) probability distribution for x
x <- rep(NA, 50000)
for(i in 1:50000){
  x[i] <- rgamma(1, gamma_samples[[1]][i,1], rate = gamma_samples[[1]][i,2])
}

# Find which values of x fall in the desired range and normalize.
length(which(x>1 & x < 1.5))/length(x)

Answer: Pr(1 < x <= 1.5) = 0.194 So pretty close to the conditional estimate, but this is not guaranteed to generally be the case.

Jacob Socolar
  • 1,172
  • 1
  • 9
  • 23
  • @ZheyuanLi You're right, the vectorized way is better. My reason for looping is because I wasn't certain that R's RNGs were vectorized in that way, and because I wasn't concerned about efficiency (so I didn't check). As for rgamma vs pgamma, the rgamma approach is a one-size-fits all to getting full posterior distributions of derived parameters from MCMC output (for any arbitrary function of parameters, just calculate it over the full set of posterior draws). Using pgamma would give us a bunch of posterior estimates of the probability of falling in the interval, but we'd still have to... – Jacob Socolar Dec 08 '16 at 00:23
  • ... think about how to aggregate them into a single estimate (which is easy to do, but is an extra bit of thinking for a lazy analyst who is automatically familiar with the first approach). – Jacob Socolar Dec 08 '16 at 00:24
3

Someone does not like my above approach, which is conditional on MLE; now let's see something unconditional. If we take direct integration, we need a triple integration: one for shape, one for rate and finally one for x. This is not appealing. I will just produce Monte Carlo estimate instead.

Under Central Limit Theorem, MLE are normally distributed. fitdistrplus::fitdist does not give standard error, but we can use MASS::fitdistr which would performs exact inference here.

fit <- fitdistr(a, "gamma", lower = c(0,0))

b <- fit$estimate
#   shape     rate 
#1.739737 1.816134 

V <- fit$vcov  ## covariance
          shape      rate
shape 0.1423679 0.1486193
rate  0.1486193 0.2078086

Now we would like to sample from parameter distribution and get samples of target probability.

set.seed(0)
## sample from bivariate normal with mean `b` and covariance `V`
## Cholesky method is used here
X <- matrix(rnorm(1000 * 2), 1000)  ## 1000 `N(0, 1)` normal samples
R <- chol(V)  ## upper triangular Cholesky factor of `V`
X <- X %*% R  ## transform X under desired covariance
X <- X + b  ## shift to desired mean
## you can use `cov(X)` to check it is very close to `V`

## now samples for `Pr(1 < x < 1.5)`
p <- pgamma(1.5, X[,1], X[,2]) - pgamma(1, X[,1], X[,2])

We can make a histogram of p (and maybe do a density estimation if you want):

hist(p, prob = TRUE)

enter image description here

Now, we often want sample mean for predictor:

mean(p)
# [1] 0.1906975
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Nice! How safe is the multivariate-normal approximation to the posterior distribution of the parameters in this case? (+1) – Jacob Socolar Dec 07 '16 at 22:48
  • Nice method, but I notice that `fitdistrplus::fitdist` provides the standard error by maximum likelihood Parameters. – Yang Yang Dec 07 '16 at 23:03
2

You can just use pgamma with estimated parameters in fit.

b <- fit$estimate
#   shape     rate 
#1.739679 1.815995 

pgamma(1.5, b[1], b[2]) - pgamma(1, b[1], b[2])
# [1] 0.1896032

Thanks. But how about P(x > 2)?

Check out the lower.tail argument:

pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)

By default, pgamma(q) evaluates Pr(x <= q). Setting lower.tail = FALSE gives Pr(x > q). So you can do:

pgamma(2, b[1], b[2], lower.tail = FALSE)
# [1] 0.08935687

Or you can also use

1 - pgamma(2, b[1], b[2])
# [1] 0.08935687
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • You are amzing! `lower.tail = TRUE` is very usefull. Thanks a lot. – Yang Yang Dec 07 '16 at 21:56
  • I disagree with this approach, which neglects posterior uncertainty around the maximum-likelihood estimates for the parameters of the gamma distribution. That is, this approach gives the right answer if the fitted parameter values are exactly correct. But that is assuredly not the case. Instead, we should integrate the probability calculated here over the full joint posterior distribution of the shape and rate parameters. I can code up a quick example of how to do this if you're interested, but it might use some unfamiliar packages/statistical fitting techniques. – Jacob Socolar Dec 07 '16 at 22:01
  • @ZheyuanLi Yes, you are right. `TRUE` stands for `<` and it is the classic way to talk about the probability. I wonder if you are familiar with Principal component analysis, I have another question on this topic at http://stackoverflow.com/questions/41022927/principal-component-analysis-pca-of-time-series-data-spatial-and-temporal-pat. Could you help me as well? Thanks a lot. – Yang Yang Dec 07 '16 at 22:02
  • @ZheyuanLi Sorry if I came across as suggesting you didn't know how integration works. I agree the uncertainty of the estimate is known, but you and I both agree that this uncertainty in parameter estimates isn't incorporated to the estimate of probability `Pr(1 < x <= 1.5)` unless we do something a bit fancy. I wouldn't necessarily label myself a Bayesian, but in cases like this one I enjoy the ease of uncertainty propagation that comes with MCMC techniques. – Jacob Socolar Dec 07 '16 at 22:43