Fast vectorized multinomial in python

Question

I am currently using NumPy for the following task: I have a large grid of values and I need to take a multinomial sample at each point. The probability vector for the multinomial will vary from grid site to grid site, so the NumPy multinomial function doesn't quite work for me since it does all of its draws from the same distribution. Iterating over each site seems incredibly inefficient and I was wondering if there was a way to do this efficiently in NumPy. It appears that something like this is possible (and fast) if I use Theano (see this answer), but that would require quite a bit of rewriting, which I would ideally like to avoid. Can multinomial sampling be vectorized efficiently in basic NumPy?

Edit: It is easy to modify Warren's code to allow different counts at different sites, as I have found that I need: all one needs to do is pass in the full count array and remove the first like, like so:

import numpy as np


def multinomial_rvs(count, p):
    """
    Sample from the multinomial distribution with multiple p vectors.

    * count must be an (n-1)-dimensional numpy array.
    * p must an n-dimensional numpy array, n >= 1.  The last axis of p
      holds the sequence of probabilities for a multinomial distribution.

    The return value has the same shape as p.
    """
    out = np.zeros(p.shape, dtype=int)
    ps = p.cumsum(axis=-1)
    # Conditional probabilities
    with np.errstate(divide='ignore', invalid='ignore'):
        condp = p / ps
    condp[np.isnan(condp)] = 0.0
    for i in range(p.shape[-1]-1, 0, -1):
        binsample = np.random.binomial(count, condp[..., i])
        out[..., i] = binsample
        count -= binsample
    out[..., 0] = count
    return out

What is the dimension of the multinomial distribution that you are interested in? That is, in terms of the [wikipedia article on the multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution), what is k? — Warren Weckesser, Apr 24 '19 at 12:58
For my application k=5 for each array site. The p vector is of the form: p = [p_s, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4], with p_s varying from site to site. — algol, Apr 24 '19 at 19:27

Warren Weckesser · Accepted Answer · 2019-04-24T22:22:03.693

Here's one way you can do it. It is not fully vectorized, but the Python loop is over the p values. If the length of your p vector(s) is not too large, this might be sufficiently fast for you.

The multinomial distribution is implemented using repeated calls to np.random.binomial, which implements broadcasting of its arguments.

import numpy as np


def multinomial_rvs(n, p):
    """
    Sample from the multinomial distribution with multiple p vectors.

    * n must be a scalar.
    * p must an n-dimensional numpy array, n >= 1.  The last axis of p
      holds the sequence of probabilities for a multinomial distribution.

    The return value has the same shape as p.
    """
    count = np.full(p.shape[:-1], n)
    out = np.zeros(p.shape, dtype=int)
    ps = p.cumsum(axis=-1)
    # Conditional probabilities
    with np.errstate(divide='ignore', invalid='ignore'):
        condp = p / ps
    condp[np.isnan(condp)] = 0.0
    for i in range(p.shape[-1]-1, 0, -1):
        binsample = np.random.binomial(count, condp[..., i])
        out[..., i] = binsample
        count -= binsample
    out[..., 0] = count
    return out

Here's an example where the "grid" has shape (2, 3), and the multinomial distribution is four dimensional (i.e. each p vector has length 4).

In [182]: p = np.array([[[0.25, 0.25, 0.25, 0.25], 
     ...:                [0.01, 0.02, 0.03, 0.94], 
     ...:                [0.75, 0.15, 0.05, 0.05]], 
     ...:               [[0.01, 0.99, 0.00, 0.00], 
     ...:                [1.00, 0.00, 0.00, 0.00], 
     ...:                [0.00, 0.25, 0.25, 0.50]]])                                                                                                 

In [183]: sample = multinomial_rvs(1000, p)                                     

In [184]: sample                                                                
Out[184]: 
array([[[ 249,  260,  233,  258],
        [   3,   21,   33,  943],
        [ 766,  131,   55,   48]],

       [[   5,  995,    0,    0],
        [1000,    0,    0,    0],
        [   0,  273,  243,  484]]])

In [185]: sample.sum(axis=-1)                                                   
Out[185]: 
array([[1000, 1000, 1000],
       [1000, 1000, 1000]])

In a comment, you said "The p vector is of the form: p = [p_s, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4], with p_s varying from site to site." Here's how you can use the above function, given an array that contains the p_s values.

First create some data for the example:

In [73]: p_s = np.random.beta(4, 2, size=(2, 3))                                                                                                        

In [74]: p_s                                                                                                                                            
Out[74]: 
array([[0.61662208, 0.6072323 , 0.62208711],
       [0.86848938, 0.58959038, 0.47565799]])

Create the array p containing the multinomial probabilities according to the formula p = [p_s, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4, (1-p_s)/4]:

In [75]: p = np.expand_dims(p_s, -1) * np.array([1, -0.25, -0.25, -0.25, -0.25]) + np.array([0, 0.25, 0.25, 0.25, 0.25])                                

In [76]: p                                                                                                                                              
Out[76]: 
array([[[0.61662208, 0.09584448, 0.09584448, 0.09584448, 0.09584448],
        [0.6072323 , 0.09819192, 0.09819192, 0.09819192, 0.09819192],
        [0.62208711, 0.09447822, 0.09447822, 0.09447822, 0.09447822]],

       [[0.86848938, 0.03287765, 0.03287765, 0.03287765, 0.03287765],
        [0.58959038, 0.1026024 , 0.1026024 , 0.1026024 , 0.1026024 ],
        [0.47565799, 0.1310855 , 0.1310855 , 0.1310855 , 0.1310855 ]]])

Now do the same as before to generate the sample (change the value 1000 to whatever is appropriate to your problem):

In [77]: sample = multinomial_rvs(1000, p)                                                                                                              

In [78]: sample                                                                                                                                         
Out[78]: 
array([[[618,  92, 112,  88,  90],
        [597, 104, 103, 101,  95],
        [605, 100,  95,  98, 102]],

       [[863,  32,  43,  27,  35],
        [602, 107, 108,  94,  89],
        [489, 130, 129, 129, 123]]])

In [79]: sample.sum(axis=-1)                                                                                                                            
Out[79]: 
array([[1000, 1000, 1000],
       [1000, 1000, 1000]])

I have found that I needed to implement different counts at different sites so I have posted a modified version of this result in the main answer. I tried to post it here but it was too long for a comment. — algol, May 03 '19 at 20:47

Szymon Maszke · Answer 2 · 2019-04-24T11:59:04.397

Possible solution

In principle you could do it with numba as multinomial distribution is supported.

Numba allows you to simply decorate your numpy (and even more importantly standard Python functions) with numba.njit decorator to get significant performance boost.

Check their documentation to see this approach in more detail. Especially 2.7.4 as it's about support for np.random (multinomial distribution supported as well).

Downside: size argument is currently not supported. You can call np.random.multinomial multiple times in a nested loop and still it should be faster if decorated with numba.njit.

Last but not least: you can parallelise outer loop with numba.prange and parallel argument to aforementioned decorator.

Performance Test

First test:

unparallelized numba with types signature
no numba at all

Code for testing:

import sys
from functools import wraps
from time import time

import numba
import numpy as np


def timing(function):
    @wraps(function)
    def wrap(*args, **kwargs):
        start = time()
        result = function(*args, **kwargs)
        end = time()
        print(f"Time elapsed: {end - start}", file=sys.stderr)
        return result

    return wrap


@timing
@numba.njit(numba.int64(numba.int64[:, :], numba.int64))
def my_multinomial(probabilities, output):
    experiments: int = 5000
    output_array = []
    for i in numba.prange(probabilities.shape[0]):
        probability = probabilities[i] / np.sum(probabilities[i])
        result = np.random.multinomial(experiments, pvals=probability)
        if i % output == 0:
            output_array.append(result)

    return output_array[-1][-1]


if __name__ == "__main__":
    np.random.seed(0)
    probabilities = np.random.randint(low=1, high=100, size=(10000, 1000))
    for _ in range(5):
        output = my_multinomial(probabilities, np.random.randint(low=3000, high=10000))

Results:

Unparallelized numba with types signature

Time elapsed: 1.0510437488555908
Time elapsed: 1.0691254138946533
Time elapsed: 1.065258264541626
Time elapsed: 1.0559568405151367
Time elapsed: 1.0446960926055908

No numba at all

Time elapsed: 0.9460861682891846
Time elapsed: 0.9581060409545898
Time elapsed: 0.9654934406280518
Time elapsed: 0.9708254337310791
Time elapsed: 0.9757359027862549

As one can see, numba is of no help in this case (actually it degrades the performance). Results are consistent for varying sizes of input array.

Second Test

parallelized numba without types signature
no numba at all

Code for test:

import sys
from functools import wraps
from time import time

import numba
import numpy as np


def timing(function):
    @wraps(function)
    def wrap(*args, **kwargs):
        start = time()
        result = function(*args, **kwargs)
        end = time()
        print(f"Time elapsed: {end - start}", file=sys.stderr)
        return result

    return wrap


@timing
@numba.njit(parallel=True)
def my_multinomial(probabilities, output):
    experiments: int = 5000
    for i in range(probabilities.shape[0]):
        probability = probabilities[i] / np.sum(probabilities[i])
        result = np.random.multinomial(experiments, pvals=probability)
        if i % output == 0:
            print(result)


if __name__ == "__main__":
    np.random.seed(0)
    probabilities = np.random.randint(low=1, high=100, size=(10000, 1000))
    for _ in range(5):
        my_multinomial(probabilities, np.random.randint(low=3000, high=10000))

Results:

Parallelized numba without types signature:

Time elapsed: 1.0705969333648682                                                                                                                                                          
Time elapsed: 0.18749785423278809                                                                                                                                                         
Time elapsed: 0.1877145767211914                                                                                                                                                          
Time elapsed: 0.18813610076904297                                                                                                                                                         
Time elapsed: 0.18747472763061523

No numba at all

Time elapsed: 1.0142333507537842                                                                                                                                                          
Time elapsed: 1.0311956405639648                                                                                                                                                          
Time elapsed: 1.022024154663086                                                                                                                                                           
Time elapsed: 1.0191617012023926                                                                                                                                                          
Time elapsed: 1.0144879817962646

Partial conclusions

As correctly pointed by max9111 in the comments, I have jumped to my conclusions too early. It seems parallelization (if possible) would be the biggest help in your case, while numba (at least in this still simple and not too comprehensive test) is not creating big improvements.

All in all you should check out your exact case, as a rule of thumb, the more Python code you use, the better results you might expect with numba. If it's mainly numpy based, than you're not going to see benefits (if at all).

As I mentioned, since I potentially need to change the input probabilities for each site, having the `size` argument wouldn't actually help here, although it will affect some other areas of my code. I was already considering whether Numba or Theano would be appropriate for what I need so this is good to know. — algol, Apr 23 '19 at 21:23
Added an example performance test (hopefully similar to your problem). Oh, and it's much better choice code-wise than Theano (it's not longer maintained and would require rewriting logic). You may check `pytorch` which is really similar to Python though `numba` seems well-suited for this case IMO. — Szymon Maszke, Apr 23 '19 at 22:12
You are measuring compilation (approx. 0.6s) and runtime (lower than 10µs). Since there is no output Numba optimizes away the part of the code you want to benchmark. (therefore the 10µs timing) 1) Never benchmark a function which makes no sense (eg. no output) in any compiled code. Often the compiler is smart enough (with some optimization flags enabled) to optimize away the useless parts. 2) Don't measure the first call if you want to get the runtime. — max9111, Apr 24 '19 at 08:09

Fast vectorized multinomial in python

2 Answers2

Possible solution

Performance Test

First test:

Results:

Second Test

Results:

Partial conclusions