4

I am coding the derivation of a Dirichlet-Multinomial posterior in a paper that I've read and I am having trouble getting the distribution to sum to 1. Here is the code in its unsimplified form:

def pcn(X, n, N, c, alpha):
    pnc = np.math.factorial(np.sum([n[i] for i in range(len(n))]))/ \
          np.product([np.math.factorial(n[i]) for i in range(len(n))])* \
          np.product([c[i]**n[i] for i in range(len(n))])
    pc = G(len(X)*alpha)/ \
         np.product([G(alpha) for i in range(len(n)) if i in X])* \
         np.product([(c[i])**(alpha - 1) for i in range(len(n)) if i in X])
    pn = np.math.factorial(N)/ \
         np.product([np.math.factorial(n[i]) for i in range(len(n)) if i in X])* \
         G(len(X)*alpha)/ \
         G(len(X)*alpha + N)* \
         np.product([G(alpha + n[i])/G(alpha) for i in range(len(n)) if i in X])
    return pnc

which I've simplified here removing the parts that get divided out:

def pcns(X, n, N, c, alpha):
    pnc = np.product([c[i]**n[i] for i in range(len(n))])

    pc = np.product([(c[i])**(alpha - 1) for i in range(len(n))])/ \
         np.product([G(alpha) for i in range(len(n))])

    pn = np.product([G(alpha + n[i])/G(alpha) for i in range(len(n))])/ \
         G(len(X)*alpha + N)

    return pnc * pc / pn

I set the input variables and initialize the array of c to input:

X = [0,1]
n = [6.0, 3.0]
N = sum(n)
alpha = 20
c = np.linspace(0, 1, 1000)

and then I iterate over c and evaluate my posterior function at each c and plot:

dist = []
for i in c:
    dist.append(pcns(X, n, N, [i, 1-i], alpha))

plt.plot(c,dist)

click here to see the posterior plot

The value I get when I sum over dist is 999 or len(c) - 1. Would anyone happen to know why it doesn't some to 1?

Remi Guan
  • 21,506
  • 17
  • 64
  • 87

1 Answers1

3

tl;dr: you're computing a discrete approximation to a definite integral and forgot to multiply the pieces by dx ~= delta_x.


Even if this were a correctly normalized probability distribution, why should dist sum to 1? Since we don't have G I can't reproduce exactly your results, so let's consider a simple Gaussian instead. We can set the mean to 0.5 and the stddev to something small, so that the range from 0 to 1 should contain most of the probability:

>>> import scipy.stats
>>> c = np.linspace(0, 1, 1000)
>>> p = scipy.stats.norm(0.5,0.02).pdf(c)
>>> sum(p)
999.00000000000045

And there's that 999 again. But again, why should this be 1? The quantity which should be one is the total integrated probability. Here we're just taking values of the probability distribution function at some points and adding them together.

Easier example: we know the definite integral of x^2 from 0 to 1 is 1/3, but

>>> x = np.linspace(0, 1, 1000)
>>> sum(x**2)
333.50016683350009

when really we want something more like (crude rectangular approximation, a little above the desired answer because we're including a contribution from the x=1 point and that box is actually out of the integral region):

>>> sum(x**2) * (x[1]-x[0])
0.3338340008343344

Or in your previous situation

>>> sum(p) * (c[1]-c[0])
1.0000000000000004

Note that here we can write * (c[1]-c[0]) because c is equally-spaced and so every "dx" in Int(p(x) dx) ~= sum(p[x] * delta_x) is the same. In general we'd want something more like sum(p[:-1] * np.diff(c)).

DSM
  • 342,061
  • 65
  • 592
  • 494
  • Great! Thank you for the informative answer. The G function, by the way, is the Gamma function. Sorry i wrote this while very sleepy. – user5590942 Nov 22 '15 at 16:36