1

I am trying to use PyMC to take as input four different predictors of a medical condition, and combine them to yield an overall posterior probability that the patient has the condition given the subset of predictors that say "yes, this patient has this condition".

The idea is to select a theta (overall rate of the condition) and false negative and false positive rate for each predictor from a beta distribution, and from that calculate a marginal probability and the posterior probability using Bayes theorem. I have an array of 16 observed values for counts, one for every possible combination of predictors (since there are 4 predictors, there are 2**4 = 16 different possible combinations of predictors). I am feeding this last set of counts and the marginal probabilities into a multinomial distribution, analogous to how disasters_array is used with a Poisson distribution in the following example from the PyMC tutorial http://pymc-devs.github.io/pymc/tutorial.html.

Here's the code I wrote to try and do this:

from pymc import Multinomial, Beta, deterministic
from numpy import array

n = 4 # number of predictors

counts_array = array([2942808473, 17491655, 21576, 23189, 339805, 89159, 168214, 76044, 43138288, 530963, 22682, 22169, 462052, 129472, 2804257, 3454104]) # 16 counts - one count value for each possible permutation of predictors that detected medical condition
pred = array([[0,0,0,0],[1,0,0,0],[0,1,0,0],[1,1,0,0],[0,0,1,0],[1,0,1,0],[0,1,1,0],[1,1,1,0],[0,0,0,1],[1,0,0,1],[0,1,0,1],[1,1,0,1],[0,0,1,1],[1,0,1,1],[0,1,1,1],[1,1,1,1]]); # array of yes/no's from predictors for each value in counts_array above

theta = Beta('theta',1,2)

fn = fp = tn = tp = [0] * 4;
for i in range(0,n):
    fn[i] = Beta('fn' + str(i),1,2)
    fp[i] = Beta('fp' + str(i),1,2)
    tn[i] = 1 - fp[i]
    tp[i] = 1 - fn[i]

@deterministic(plot=False)
def margprobs():
    mp = [0] * 2**n; # initialize with vector of 2**n zeros
    for i in range(0,2**n):
        mp[i] = theta *\
            (fn[0]**(1-pred[i][0])) * (tp[0]**pred[i][0]) *\
            (fn[1]**(1-pred[i][1])) * (tp[1]**pred[i][1]) *\
            (fn[2]**(1-pred[i][2])) * (tp[2]**pred[i][2]) *\
            (fn[3]**(1-pred[i][3])) * (tp[3]**pred[i][3])\
            + (1-theta) *\
            (tn[0]**(1-pred[i][0])) * (fp[0]**pred[i][0]) *\
            (tn[1]**(1-pred[i][1])) * (fp[1]**pred[i][1]) *\
            (tn[2]**(1-pred[i][2])) * (fp[2]**pred[i][2]) *\
            (tn[3]**(1-pred[i][3])) * (fp[3]**pred[i][3]);
    return mp;

@deterministic(plot=False)
def postprobs():
    pp = [0] * 2**n; # initialize with vector of 2**n zeros
    for i in range(0,2**n):
        pp[i] = theta *\
            (fn[0]**(1-pred[i][0])) * (tp[0]**pred[i][0]) *\
            (fn[1]**(1-pred[i][1])) * (tp[1]**pred[i][1]) *\
            (fn[2]**(1-pred[i][2])) * (tp[2]**pred[i][2]) *\
            (fn[3]**(1-pred[i][3])) * (tp[3]**pred[i][3])\
            / margprobs[i];
    return pp;

counts = Multinomial(name="counts", value=counts_array, n=2**n, p=margprobs, observed=True)

When I run this, I get an error to do with the last line, when calculating counts:

$ python test.py
Traceback (most recent call last):
  File "test.py", line 46, in <module>
    counts = Multinomial(name="counts", value=counts_array, n=2**n, p=margprobs, observed=True)
  File "/Users/jtr4v/anaconda/lib/python2.7/site-packages/pymc/distributions.py", line 3269, in __init__
    verbose=verbose, **kwds)
  File "/Users/jtr4v/anaconda/lib/python2.7/site-packages/pymc/PyMCObjects.py", line 772, in __init__
    if not isinstance(self.logp, float):
  File "/Users/jtr4v/anaconda/lib/python2.7/site-packages/pymc/PyMCObjects.py", line 929, in get_logp
    raise ZeroProbability(self.errmsg)
pymc.Node.ZeroProbability: Stochastic counts's value is outside its support,
 or it forbids its parents' current values

Obviously this error is to do with PyMC not liking some value I'm feeding to Multinomial(), but I not sure which one is wrong. I believe that value should be counts_array (my observed values for count), n should be 16 since I want to select an array of 16 items for count, one for each possible combination of predictors, and p should be my marginal probabilities, and observed should be true since I have observed values.

What am I doing wrong?

Edit: If it helps, I was previously doing this in R2jags with the following code:

model {
    theta ~ dbeta(1,2); # draw theta from a beta distribution
    for (i in 1:N) { # draw an false positive and false negative rate for each of N predictors
        fp[i] ~ dbeta(1,2);
        fn[i] ~ dbeta(1,2);
        tp[i] <- 1-fn[i]; # true positive = 1 - false negative rate
        tn[i] <- 1-fp[i]; # true negative rate = 1 - false positive rate
    }
    for (j in 1:M) {
    # Bayes theorem, i.e.
    # posterior probability =
    # P(A) * P(B|A) /
    # /
    # P(A) * P(B|A) + P(-A) * P(B|-A)  # <--- last line is marginal probability
    #
    # x is a vector of 1's and 0's indicating whether the ith predictor said yes or no
    margprobs[j] <- (theta *
                        (fn[1]^(1-x[j,1])) * (tp[1]^x[j,1]) *
                        (fn[2]^(1-x[j,2])) * (tp[2]^x[j,2]) *
                     (fn[3]^(1-x[j,3])) * (tp[3]^x[j,3]) *
                     (fn[4]^(1-x[j,4])) * (tp[4]^x[j,4])
                 + (1-theta) *
                        (tn[1]^(1-x[j,1])) * (fp[1]^x[j,1]) *
                     (tn[2]^(1-x[j,2])) * (fp[2]^x[j,2]) *
                     (tn[3]^(1-x[j,3])) * (fp[3]^x[j,3]) *
                     (tn[4]^(1-x[j,4])) * (fp[4]^x[j,4]));

    postprobs[j] <- theta *
                        (fn[1]^(1-x[j,1])) * (tp[1]^x[j,1]) *
                        (fn[2]^(1-x[j,2])) * (tp[2]^x[j,2]) *
                        (fn[3]^(1-x[j,3])) * (tp[3]^x[j,3]) *
                        (fn[4]^(1-x[j,4])) * (tp[4]^x[j,4])
                        / margprobs[j];


    }
    counts ~ dmulti(margprobs, total);
}
slattering
  • 11
  • 3
  • Is it possible to fix `margprobs` to some known value to see whether invalid probabilities are being passed somehow? – jbaums Nov 18 '14 at 09:49
  • Good idea. When I set p to a vector of 16 x 0.5's (instead of margprobs), I get the same error (Stochastic counts's value is outside its support..) I guess this might mean that the problem is the counts_array values? – slattering Nov 18 '14 at 13:26
  • I'm pretty sure `n` is the number of trials that were performed, from which you observed outcomes according to `value`. It seems to me that you're saying you observed outcome one 2942808473 times, outcome two 17491655 times, and so on, from only 16 trials. I think the elements of `value` should sum to `n`. – jbaums Nov 18 '14 at 17:42
  • Thanks jbaums, good idea. When I set value=counts_array, n=int(sum(counts_array)), p=margprobs, I still get that same error. "Stochastic counts's value is outside its support, or it forbids its parents' current values"... – slattering Nov 19 '14 at 07:23
  • Do the `p` need to sum to 1, or are they treated as relative weights and normalised by pymc? – jbaums Nov 19 '14 at 07:39
  • Not sure what PyMC expects, but the values returned by margprobs probably are not going to sum to 1. Do they need to? – slattering Nov 20 '14 at 08:00
  • No idea... I tried to install PyMC so I could test it at my end, but ran into errors during installation that I didn't have the patience to resolve. However, looking at the error message again, the `ZeroProbability` error perhaps implies that one or more elements of `margprobs` is 0. (Although this should have been avoided when you set all `p` to 0.5. Maybe try setting all `p` to 1/16? Sorry I'm not of more help - I don't have experience with PyMC.. just trying to think of ways to debug. – jbaums Nov 20 '14 at 08:04
  • Thanks a jbaums, helpful to have some suggestions to try and sort out the issue here. BTW, on my macbook the only way I could install PyMC was thusly: conda install -c https://conda.binstar.org/tobeplugged pymc – slattering Nov 20 '14 at 11:46
  • No dice, when I set margprobs to [1.0/16]*2**4, I'm getting the same old error, "Stochastic counts's value is outside its support..." Worth a try though – slattering Nov 20 '14 at 11:51
  • 1
    Judging from this question I think that p does indeed need to sum to 1: http://stackoverflow.com/questions/23879562/multinomial-distribution-in-pymc - funny thing, when I substitute my counts_array into count in the example in that link, I get the same error "value outside of support". I think PyMC maybe just doesn't like numbers as high as those in my counts_array. When I make the large counts in my array smaller, the example in that link works fine... – slattering Nov 20 '14 at 13:39
  • Yep, the values in counts_array are too big. When I make the values smaller, my code above runs okay. I wonder if it's statistically valid for me to downsample my counts so that PyMC is okay with my counts_array? – slattering Nov 20 '14 at 13:44
  • I imagine so - the parameters will be estimated based on relative counts, won't they? But you could see how sensitive the estimates are to changes in the magnitude of counts. – jbaums Nov 20 '14 at 14:14
  • Thanks again jbaums. Think I might have a look at PyStan and see if it can deal with my real counts data... – slattering Nov 21 '14 at 16:50

0 Answers0