How can I generate a random sample of bin counts given a sequence of bin probabilities?

Question

I have a integer that needs to be split up in to bins according to a probability distribution. For example, if I had N=100 objects going into [0.02, 0.08, 0.16, 0.29, 0.45] then you might get [1, 10, 20, 25, 44].

import numpy as np
# sample distribution
d = np.array([x ** 2 for x in range(1,6)], dtype=float)
d = d / d.sum()
dcs = d.cumsum()
bins = np.zeros(d.shape)
N = 100
for roll in np.random.rand(N):
    # grab the first index that the roll satisfies
    i = np.where(roll < dcs)[0][0]  
    bins[i] += 1

In reality, N and my number of bins are very large, so looping isn't really a viable option. Is there any way I can vectorize this operation to speed it up?

ali_m · Accepted Answer · 2015-07-30T18:47:42.677

5

You can convert your PDF to a CDF by taking the cumsum, use this to define a set of bins between 0 and 1, then use these bins to compute the histogram of an N-long random uniform vector:

cdf = np.cumsum([0, 0.02, 0.08, 0.16, 0.29, 0.45])     # leftmost bin edge = 0
counts, edges = np.histogram(np.random.rand(100), bins=cdf)

print(counts)
# [ 4,  8, 16, 30, 42]

edited Jul 30 '15 at 18:47

answered Jul 30 '15 at 18:15

ali_m

71,714
23
223
298

Of course, I use the term binning and neglect to look up histograms. Thanks a ton! – Matt Jul 30 '15 at 20:17
No worries. I've taken the liberty of editing the title of your question in order to make it easier for others to find. – ali_m Jul 30 '15 at 20:33

score 2 · Answer 2 · answered Jul 31 '15 at 05:19

You can use np.bincount for a binning operation alongwith np.searchsorted to perform the equivalent of roll < dcs operation. Here's an implementation to fulfill these promises -

bins = np.bincount(np.searchsorted(dcs,np.random.rand(N),'right'))

Runtime test using given parameters -

In [72]: %%timeit
    ...: for roll in np.random.rand(N):
    ...:     # grab the first index that the roll satisfies
    ...:     i = np.where(roll < dcs)[0][0]  
    ...:     bins[i] += 1
    ...: 
1000 loops, best of 3: 721 µs per loop

In [73]: %%timeit
    ...: np.bincount(np.searchsorted(dcs,np.random.rand(N),'right'))
    ...: 
100000 loops, best of 3: 13.5 µs per loop

Your solution is quite a bit faster than the other for smaller system sizes by a factor of 10, but slows to be a factor of 10 slower for larger system sizes. Neat, thanks for alternative solution! — Matt, Aug 05 '15 at 19:06

score 0 · Answer 3 · answered May 31 '20 at 03:15

One more way to do this:

import numpy as np

p = [0.02, 0.08, 0.16, 0.29, 0.45]
np.bincount(np.random.choice(range(len(p)), size=100, p=p), minlength=len(p))
# array([ 1,  6, 16, 25, 52])

It seems like allocating a length-100 array should be unnecessary but I haven't seen a way in numpy to avoid it.

How can I generate a random sample of bin counts given a sequence of bin probabilities?

3 Answers3