59

I am looking for a simple function that can generate an array of specified random values based on their corresponding (also specified) probabilities. I only need it to generate float values, but I don't see why it shouldn't be able to generate any scalar. I can think of many ways of building this from existing functions, but I think I probably just missed an obvious SciPy or NumPy function.

E.g.:

>>> values = [1.1, 2.2, 3.3]
>>> probabilities = [0.2, 0.5, 0.3]
>>> print some_function(values, probabilities, size=10)
(2.2, 1.1, 3.3, 3.3, 2.2, 2.2, 1.1, 2.2, 3.3, 2.2)

Note: I found scipy.stats.rv_discrete but I don't understand how it works. Specifically, I do not understand what this (below) means nor what it should do:

numargs = generic.numargs
[ <shape(s)> ] = ['Replace with resonable value', ]*numargs

If rv_discrete is what I should be using, could you please provide me with a simple example and an explanation of the above "shape" statement?

TimY
  • 5,256
  • 5
  • 44
  • 57

5 Answers5

88

Drawing from a discrete distribution is directly built into numpy. The function is called random.choice (difficult to find without any reference to discrete distributions in the numpy docs).

elements = [1.1, 2.2, 3.3]
probabilities = [0.2, 0.5, 0.3]
np.random.choice(elements, 10, p=probabilities)
Hari
  • 1,561
  • 4
  • 17
  • 26
goebbe
  • 904
  • 7
  • 2
  • 3
    Great! But, the correct syntax is: np.random.choice(elements, 10, p=list(probabilities)) – Sina Feb 01 '16 at 03:13
  • 1
    Nice. I think this version came out after I posted my original question (I think this was first released in 1.7.0 which I believe came in 2013). – TimY Jul 22 '16 at 09:01
  • 2
    Very nice! Seems to work also without casting to list: np.random.choice(elements, 10, p=probabilities)). – zeycus Jul 26 '16 at 17:37
  • In addition to comments by [Sina](http://stackoverflow.com/questions/11373192/generating-discrete-random-variables-with-specified-weights-using-scipy-or-numpy/32179721#comment57965531_32179721) and [zeycus](http://stackoverflow.com/questions/11373192/generating-discrete-random-variables-with-specified-weights-using-scipy-or-numpy/32179721#comment64580901_32179721), `elements` and `probabilites` could have been ordinary `list`s instead of `numpy.array`s and the code would work the same. – arekolek Jul 31 '16 at 08:05
26

Here is a short, relatively simple function that returns weighted values, it uses NumPy's digitize, accumulate, and random_sample.

import numpy as np
from numpy.random import random_sample

def weighted_values(values, probabilities, size):
    bins = np.add.accumulate(probabilities)
    return values[np.digitize(random_sample(size), bins)]

values = np.array([1.1, 2.2, 3.3])
probabilities = np.array([0.2, 0.5, 0.3])

print weighted_values(values, probabilities, 10)
#Sample output:
[ 2.2  2.2  1.1  2.2  2.2  3.3  3.3  2.2  3.3  3.3]

It works like this:

  1. First using accumulate we create bins.
  2. Then we create a bunch of random numbers (between 0, and 1) using random_sample
  3. We use digitize to see which bins these numbers fall into.
  4. And return the corresponding values.
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
fraxel
  • 34,470
  • 11
  • 98
  • 102
  • 1
    Yes this is basically what I was thinking of, but I just thought there may be a built-in function that does exactly that. From the sound of it, there is no such thing. I must admit - I would have not done it as elegantly. - Thanks – TimY Jul 07 '12 at 15:40
  • NumPy directly offers `numpy.cumsum()`, which can be used instead of `np.add.accumulate()` (`np.add()` is not very commonly used, so I recommend using `cumsum()`). – Eric O. Lebigot Feb 02 '13 at 08:11
  • +1 for the useful `numpy.digitize()`! However, SciPy actually offers a function that directly answers the question—see my answer. – Eric O. Lebigot Feb 02 '13 at 08:29
  • PS:… As noted by Tim_Y, using SciPy's function is much slower than using your "manual" solution (on 10k elements). – Eric O. Lebigot Feb 02 '13 at 13:09
  • 1
    Do the probabilities need to be normalized for this ? – IssamLaradji Jan 24 '15 at 20:55
  • @Curious: Yes, the probabilities do have to be normalized, because `random_sample()` returns numbers in [0; 1), so the bins cannot extend beyond this range (they would if the probabilities would sum to more than 1). – Eric O. Lebigot May 29 '16 at 20:53
17

You were going in a good direction: the built-in scipy.stats.rv_discrete() quite directly creates a discrete random variable. Here is how it works:

>>> from scipy.stats import rv_discrete  

>>> values = numpy.array([1.1, 2.2, 3.3])
>>> probabilities = [0.2, 0.5, 0.3]

>>> distrib = rv_discrete(values=(range(len(values)), probabilities))  # This defines a Scipy probability distribution

>>> distrib.rvs(size=10)  # 10 samples from range(len(values))
array([1, 2, 0, 2, 2, 0, 2, 1, 0, 2])

>>> values[_]  # Conversion to specific discrete values (the fact that values is a NumPy array is used for the indexing)
[2.2, 3.3, 1.1, 3.3, 3.3, 1.1, 3.3, 2.2, 1.1, 3.3]

The distribution distrib above thus returns indexes from the values list.

More generally, rv_discrete() takes a sequence of integer values in the first elements of its values=(…,…) argument, and returns these values, in this case; there is no need to convert to specific (float) values. Here is an example:

>>> values = [10, 20, 30]
>>> probabilities = [0.2, 0.5, 0.3]
>>> distrib = rv_discrete(values=(values, probabilities))
>>> distrib.rvs(size=10)
array([20, 20, 20, 20, 20, 20, 20, 30, 20, 20])

where (integer) input values are directly returned with the desired probability.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • 4
    NOTE: I tried running timeit on it, and it appears to be a good 100x slower than fraxel's purely numpy version. Do you by any chance know why that is? – TimY Feb 02 '13 at 12:55
  • Wow, interesting! On 10k elements, I even get a factor of 300x slower. I had a quick look at the code: there are many checks performed, but I guess that they cannot explain such a big difference in running time; I did not go deep enough into the Scipy code to have been able to see where the difference could come from… – Eric O. Lebigot Feb 02 '13 at 13:23
  • @TimY my naive guess is that the slowness is due to more work being done in pure Python, less work being done (under the hood) in C. (the mathematical/scientific packages in Python tend to wrap C code.) – abcd Feb 17 '16 at 15:20
  • suppose i were to start with an equation for my probability distribution. it seems silly to have to use that to generate a probability for each value, feed that to `rv_discrete`, and then get back from `rv_discrete` an approximation of the distribution i started with. is there any way to use user-defined equations directly with `scipy`? – abcd Feb 18 '16 at 17:56
  • @dbliss: I guess that your equation models a _continuous_ random variable, instead of a discrete one (which is the subject of this question), so going through a discrete variable may indeed not be best (unless the approximation of a discrete distribution helps you get faster running code). You might want to look at how SciPy handles continuous variables, for example starting from http://scicomp.stackexchange.com/a/1659/9996. – Eric O. Lebigot Feb 23 '16 at 17:31
  • @EOL nope, i'm using a discrete random variable. not sure why you think i'm not. it turns out i'm using a Poisson random variable, and there's a function in `numpy` for drawing samples from a Poisson distribution (`np.random.poisson`). the same i'm sure is true of most standard distributions. my question remains unanswered, though, for more idiosyncratic distributions. – abcd Feb 23 '16 at 20:01
  • 1
    @dbliss Now I see that you had in mind the case of a discrete distribution with an _infinite_ number of possible values (which does not fit into this question). `rv_discrete()` does not have an option for this. I am not sure what the standard method for doing this is. (I can only think of slightly complicated variations of the usual method that transforms a uniform random variable into a variable with a non-uniform distribution, where the cumulative probability is only calculated for the most common values and extended beyond that when needed.) – Eric O. Lebigot Feb 24 '16 at 22:13
6

The simplest DIY way would be to sum up the probabilities into a cumulative distribution. This way, you split the unit interval into sub-intervals of the length equal to your original probabilities. Now generate a single random number uniform on [0,1), and and see to which interval it lands.

ev-br
  • 24,968
  • 9
  • 65
  • 78
4

You could also use Lea, a pure Python package dedicated to discrete probability distributions.

>>> distrib = Lea.fromValFreqs((1.1,2),(2.2,5),(3.3,3))
>>> distrib
1.1 : 2/10
2.2 : 5/10
3.3 : 3/10
>>> distrib.random(10)
(2.2, 2.2, 1.1, 2.2, 2.2, 2.2, 1.1, 3.3, 1.1, 3.3)

Et voilà!

Pierre Denis
  • 93
  • 1
  • 6