23

What is Big-O complexity of random.choice(list) in Python3, where n is amount of elements in a list?

Edit: Thank You all for give me the answer, now I understand.

General Grievance
  • 4,555
  • 31
  • 31
  • 45
mil
  • 231
  • 1
  • 2
  • 4
  • If it's not stated in the specification, it's presumably implementation-dependent. – Barmar Oct 19 '16 at 23:40
  • 2
    I can't imagine any reason why it wouldn't be `O(1)`. It just needs to pick a random number `i` from `0` to `len(list)`, then return `list[i]`. They're all constant-time operations. – Barmar Oct 19 '16 at 23:43
  • If Python lists were implemented as linked lists, it would be `O(n)`, since getting the length of a list is linear, as is accessing a selected element. But since they're actually arrays, everything is constant. – Barmar Oct 19 '16 at 23:44

4 Answers4

17

O(1). Or to be more precise, it's equivalent to the big-O random access time for looking up a single index in whatever sequence you pass it, and list has O(1) random access indexing (as does tuple). Simplified, all it does is seq[random.randrange(len(seq))], which is obviously equivalent to a single index lookup operation.

An example where it would be O(n) is collections.deque, where indexing in the middle of the deque is O(n) (with a largish constant divisor though, so it's not that expensive unless the deque is reaching the thousands of elements range or higher). So basically, don't use a deque if it's going to be large and you plan to select random elements from it repeatedly, stick to list, tuple, str, byte/bytearray, array.array and other sequence types with O(1) indexing.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • What do you mean with "largish" constant divisor? – Stefan Pochmann Oct 20 '16 at 00:33
  • 1
    @StefanPochmann: Implementation details here, but CPython's `deque` is a linked list of blocks that each store [up to 64 values](https://hg.python.org/cpython/file/c445746d0846/Modules/_collectionsmodule.c#l21) (the left and right blocks may be incomplete). Until the length crests 66, no lookup can require any block traversal, and even up to a thousand elements, you're still looking at a maximum of 17 blocks (at most half of which must be traversed for indexing), and the Python interpreter overhead generally swamps small C level costs like traversing 8 hops in a linked list. – ShadowRanger Oct 20 '16 at 00:39
  • Oh cool, I had no idea it does that. But makes sense. Thanks, nice to know. – Stefan Pochmann Oct 20 '16 at 00:47
  • Does O(1) still hold true if I'm using a weighted random choice? https://docs.python.org/3/library/random.html#random.choices Similar to setting [p in `np.random.choice()`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html) rather than using a uniform distribution? – stefanbschneider Mar 05 '20 at 13:34
  • 1
    @CGFoX: Weighted choice for `choices` is `O(n log m)`, where `n` is the number of choices being made, and `m` is the total number of weights (it bisects the cumulative weights for each choice, which is `O(log m)` work, and does that `n` times). So it's still `O(1)` per selection in terms of the sequence being selected from (excluding cases like `deque`), it's just a tiny bit more expensive to weight the choice. – ShadowRanger Mar 05 '20 at 13:56
  • Thanks for the prompt reply! Without using `random.choice`, would it still be possible to achieve O(1) for a weighted choice if I do some pre-processing? I could randomly select some `i` in the range of all summed up weights in O(1). And then use a hash table to map the selected `i` to one of the `n` choices in O(1). Would require pre-processing and extra memory but allow weighted random choice in O(1), right? – stefanbschneider Mar 05 '20 at 14:10
  • 1
    @CGFoX: Providing cumulative weights instead of relative weights saves preprocessing from relative to cumulative. Improving on that would require you to make a side-band `list` that described each index `x` number of times (e.g. if index `0` was 10% likely, and `1` was 90% likely, you'd make a list `[0,1,1,1,1,1,1,1,1,1]`, do a `O(1)` selection from that and then do a `O(1)` lookup of the original source `list`). But that side-band `list` has a minimum size equal to the original `list`, and for weird percentage weights, it could occupy all your memory. `O(log m)` is cheap enough. – ShadowRanger Mar 05 '20 at 14:16
  • Just to clarify: If I limit myself to integer percentages as cumulative weights, I'd have a side-band list of size 100 (from 0% to 100%) and could achieve O(1) look up? – stefanbschneider Mar 05 '20 at 14:38
  • @CGFoX: Not in the design I mentioned, where each index in the side-band list corresponds to an index in the main list (which is the only way to individually weight indices in the main list while preserving `O(1)` work). Cumulative weights must be searched; since it's a binary search strategy, that's `O(log m)` in the number of weights. If you limit yourself to integer cumulative weights, that's pretty cheap anyway (log₂100 is less than 7, which is not likely to kill you, perf-wise); I'm not sure why you're so insistent on `O(1)`, when it's going to cost so much on preprocessing and memory. – ShadowRanger Mar 05 '20 at 17:29
7

Though the question is about random.choice and previous answers on it have several explanations, when I searched for the complexity of np.random.choice, I didn't find an answer, so I decide to explain about np.random.choice.

choice(a, size=None, replace=True, p=None). Assume a.shape=(n,) and size=m.

When with replacement:

The complexity for np.random.choice is O(m) if p is not specified (assuming it as uniform distribution), and is O(n + n log m ) if p is specified.

The github code can be find here np.random.choice.

When p is not specified, choice generates an index array by randint and returns a[index], so the complexity is O(m). (I assume the operation of generating a random integer by randint is O(1).)

When p is specified, the function first computes prefix sum of p. Then it draws m samples from [0, 1), followed by using binary search to find a corresponding interval in the prefix sum for each drawn sample. The evidence to use binary search can be found here. So this process is O(n + m log n). If you need a faster method in this situation, you can use Alias Method, which needs O(n) time for preprocessing and O(m) time to sample m items.


When without replacement: (It's kind of complicated, and maybe I'll finish it in the future.)

If p is not specified, the complexity is the same as np.permutation(n), even when m is only 1. See more here.

If p is specified, the expected complexity is at least $n \log n \log\frac{n}{n + 1 - m}$. (This is an upperbound, but not tight.)

Muzhi
  • 109
  • 1
  • 5
  • Thanks for the answer on np.random.choice, can you please specify better why: ""When p is not specified, choice generates index by randint and returns a[index], so the complexity is O(m)."" Doesn't python have O(1) time complexity when it performs lookup on lists? Here's a link: https://wiki.python.org/moin/TimeComplexity Thanks – Antonio Ercole De Luca May 12 '20 at 11:11
  • @eracle: Thanks for your comment. I've updated the answer and make it clear that we use `np.random.choice` to sample `m` samples from an array of length `n`. In the situation you mention, we need O(m) time to generate an index array of size `m` and performs `m` lookups. However, I've assumed that using `np.random.randint` to sample a random integer needs O(1) time. I'm not really sure about this assumption. – Muzhi May 14 '20 at 09:13
  • ```The complexity for np.random.choice is O(m) if p is not specified (assuming it as uniform distribution), and is O(n + n log m ) if p is specified.``` . ```O(n + n log m )``` should be ```O(n + m log n )``` – DachuanZhao Oct 27 '20 at 10:33
5

The complexity of random.choice(list) is O(log n) where n is the number of elements in the list.

The cpython implementation uses _randbelow(len(seq)) to get a pseudo-random index and then returns the item at that index.

The bottleneck is the _randbelow() function which uses rejection sampling to generate a number in the range [0, n). The function generates k pseudo-random bits with a call to getrandbits(k) where k is ceil(log N). These bits represent a number in the range [0, 2**k). This process is repeated until the generated number is less than n. Each call to the pseudo-random number generator runs in O(k) where k is the number of bits generated which is O(log n).

chlamb16
  • 91
  • 1
  • 2
  • 4
-1

I think the above answer is incorrect. I empirically verified that the complexity of this operation is O(n). Here is my code and a little plot. I am not sure about the theory though.

from time import time
import numpy as np 
import matplotlib.pyplot as plt
N = np.logspace(2, 10, 40)
output = []
for i, n in enumerate(N):
    print(i)
    n = int(n)
    stats = time()
    A = np.random.choice(list(range(n)), n//2)
    output.append(time()-stats)   
plt.plot(N, output)

This is the plot I got which looks quite linear to me. Complexity of np.random.choice

mhsnk
  • 97
  • 1
  • 6
  • 8
    The runtime of the test you performed is dominated by the list(range(n)) which creates a list of length n. That operation is O(n). You want to generate the list before staring your timer so that it only includes the np.random.choice function. – chlamb16 Jun 09 '20 at 04:51
  • I'd also suggest using the using the `timeit` module, and using a `loglog` plot, otherwise you're not going to see the small `N`s – Sam Mason Apr 27 '21 at 12:44