17

I need to create a large numpy array containing random boolean values without hitting the swap.

My laptop has 8 GB of RAM. Creating a (1200, 2e6) array takes less than 2 s and use 2.29 GB of RAM:

>>> dd = np.ones((1200, int(2e6)), dtype=bool)
>>> dd.nbytes/1024./1024
2288.818359375

>>> dd.shape
(1200, 2000000)

For a relatively small (1200, 400e3), np.random.randint is still quite fast, taking roughly 5 s to generate a 458 MB array:

db = np.array(np.random.randint(2, size=(int(400e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'

But if I double the size of the array to (1200, 800e3) I hit the swap, and it takes ~2.7 min to create db ;(

cmd = """
import numpy as np
db = np.array(np.random.randint(2, size=(int(800e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'"""

print timeit.Timer(cmd).timeit(1)

Using random.getrandbits takes even longer (~8min), and also uses the swap:

from random import getrandbits
db = np.array([not getrandbits(1) for x in xrange(int(1200*800e3))], dtype=bool)

Using np.random.randint for a (1200, 2e6) just gives a MemoryError.

Is there a more efficient way to create a (1200, 2e6) random boolean array?

ali_m
  • 71,714
  • 23
  • 223
  • 298
user3313834
  • 7,327
  • 12
  • 56
  • 99

1 Answers1

14

One problem with using np.random.randint is that it generates 64-bit integers, whereas numpy's np.bool dtype uses only 8 bits to represent each boolean value. You are therefore allocating an intermediate array 8x larger than necessary.

A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using np.random.bytes, which can be converted to an array of 8-bit integers using np.fromstring. These integers can then be converted to boolean values, for example by testing whether they are less than 255 * p, where p is the desired probability of each element being True:

import numpy as np

def random_bool(shape, p=0.5):
    n = np.prod(shape)
    x = np.fromstring(np.random.bytes(n), np.uint8, n)
    return (x < 255 * p).reshape(shape)

Benchmark:

In [1]: shape = 1200, int(2E6)

In [2]: %timeit random_bool(shape)
1 loops, best of 3: 12.7 s per loop

One important caveat is that the probability will be rounded down to the nearest multiple of 1/256 (for an exact multiple of 1/256 such as p=1/2 this should not affect accuracy).


Update:

An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to np.bool using np.unpackbits:

def fast_random_bool(shape):
    n = np.prod(shape)
    nb = -(-n // 8)     # ceiling division
    b = np.fromstring(np.random.bytes(nb), np.uint8, nb)
    return np.unpackbits(b)[:n].reshape(shape).view(np.bool)

For example:

In [3]: %timeit fast_random_bool(shape)
1 loops, best of 3: 5.54 s per loop
Community
  • 1
  • 1
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • 1
    Your last solution will be even faster if, rather than doing `.astype(np.bool)`, you instead go with `.view(np.bool)` or `.astype(np.bool, copy=False)`, as either one will spare you a copy of the full array. – Jaime Dec 28 '15 at 08:14
  • @Jaime Thanks - I always forget that `.astype()` returns a copy by default – ali_m Dec 28 '15 at 13:30
  • thank's @ali_m this random boolean array is in the context of a numpy-broadcasting question : http://stackoverflow.com/q/34496409/3313834 – user3313834 Dec 29 '15 at 12:09
  • 1
    The "generate bytes and compare to `255*p`" strategy has the limitation that the probabilities are rounded to multiples of 1/256, and not always the "right" multiple of 1/256. – user2357112 Apr 13 '16 at 21:50
  • @user2357112 You're right - I've edited my answer to mention this caveat, although I'm not sure it's possible to do any better without allocating a larger intermediate array. – ali_m Apr 14 '16 at 12:05