0

I'm trying to use PyMC 2.3 to obtain an estimate of the parameter of a compound model. By "compound" I mean a random variable that follows a distribution whose whose parameter is another random variable. ("nested" or "hierarchical" are somtimes used to refer to this situation, but I think they are less specific and generate more confusion in this context).

Let make a specific example. The "real" data is generated from a compound distribution that is a Poisson with a parameter that is distributed as an Exponential. In plain scipy the data is generated as follows:

import numpy as np
from scipy.stats import distributions
np.random.seed(3)  # for repeatability

nsamples = 1000
tau_true = 50
orig_lambda_sample = distributions.expon(scale=tau_true).rvs(nsamples)
data = distributions.poisson(orig_lambda_sample).rvs(nsamples)

I want to obtain an estimate of the model parameter tau_true. My approach so far in modelling this problem in PyMC is the following:

tau = pm.Uniform('tau', 0, 100)
count_rates = pm.Exponential('count_rates', beta=1/tau, size=nsamples)
counts = pm.Poisson('counts', mu=count_rates, value=data, observed=True)

Note that I use size=nsamples to have a new stochastic variable for each sample.

Finally I run the MCMC as:

model = pm.Model([count_rates, counts, tau])
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000)

The model converges (although slowly, > 10^5 iterations) to a distribution centred around 50 (tau_true). However it seems like an overkill to define 1000 stochastic variables to fit a single distribution with a single parameter.

Is there a better way to describe a compound model in PyMC?

PS I also tried with a more informative prior (tau = pm.Normal('tau', mu=51, tau=1/2**2)) but the results are similar and the convergence is still slow.

user2304916
  • 7,882
  • 5
  • 39
  • 53

1 Answers1

1

It looks like what you are trying to model is data that is over-dispersed. In fact, a negative binomial distribution is just a Poisson with a mean that is distributed according to a gamma distribution (the general form of the Exponential). So, one way to get around defining 1000 variables is to use the negative binomial directly. Keep in mind, though, despite there being nominally 1000 variables, the effective number of variables is somewhere between 1 and 1000, depending on how constrained the distribution of means is. You are essentially defining a random effect here.

Chris Fonnesbeck
  • 4,143
  • 4
  • 29
  • 30
  • Thanks for confirming this is the right approach in PyMC. Yes, I know this is just a negative binomial distribution. But my question is general on how to define a compound variable. In the real case, the first distribution is not a pure gamma (is truncated). – user2304916 Oct 14 '14 at 22:13
  • Still the (almost non) convergence is a show stopper (unless using the negative binomial). If this is the case, maybe MCMC Bayesian inference is not the right tool for compound models when there are 1000s of data points. What do you think? – user2304916 Oct 14 '14 at 22:18
  • MCMC scales well with model size (many parameters) but not as well with data size (many observations). If you have a ton of observations, convergence should not be an issue (at least for a simple model), but speed might be. If you are just using a simple model like that, you can try a non-iterative approximation approach, such as MAP and NormApprox. – Chris Fonnesbeck Oct 16 '14 at 16:01
  • Also, in the development version (PyMC 3) there is a Hamiltonian MC method that samples more efficiently from large models. However, the new version is a breaking update from PyMC 2.3, and it is not yet well documented, so you'd have to follow the examples. – Chris Fonnesbeck Oct 16 '14 at 16:02