6

Select an item from a stream at random with uniform probability, using constant space

The stream provides the following operations:

class Stream:

  def __init__(self, data):
    self.data = list(data)

  def read(self):
    if not self.data:
      return None

    head, *self.data = self.data
    return head

  def peek(self):
    return self.data[0] if self.data else None

The elements in the stream (ergo the elements of data) are of constant size and neither of them is None, so None signals end of stream. The length of stream can only be learned by consuming the entire stream. And note that counting the number of elements consumes O(log n) space.

I believe there is no way to uniformly choose an item from the stream at random using O(1) space.

Can anyone (dis)prove this?

  • Your definition of "O(1) space" precludes a PRNG with enough state space to select an element uniformly, so it's trivially impossible... But that's not really a practical definition of O(1) space. – Matt Timmermans Apr 11 '19 at 23:06
  • @MattTimmermans And if I use true RNG device which returns rv bit by bit? – Severin Pappadeux Apr 12 '19 at 00:05
  • @SeverinPappadeux sure, or an analog computer, or a lot of other things, but there's no indication that any of that is allowed. – Matt Timmermans Apr 12 '19 at 00:11
  • 1
    Why does counting the number of elements consume O(log n) space? – Mooing Duck Apr 12 '19 at 00:20
  • Huh?!? Where did you find such restriction? Writing such generator on top of /dev/random is quite trivial – Severin Pappadeux Apr 12 '19 at 00:21
  • @MooingDuck for a stream of n items, you need enough storage to represent n. For example if **n=16**, you need **4 bits** because each of them can have two distinct values (0/1) and together there are 4^2 = 16 combinations. You can see that **4 = log2 16 = log2 n**. Similarly for other n. – Борат Сагдиев Apr 12 '19 at 08:42
  • @БоратСагдиев: In big-O analysis, it's usually assumed that one storage unit can hold any number, regardless of bits. It only takes O(1) to store the count. – Mooing Duck Apr 12 '19 at 16:28
  • Yes, this is the so-called uniform cost model. I omitted this in the question, but I am assuming a logarithmic cost model. See https://en.wikipedia.org/wiki/Analysis_of_algorithms#Cost_models – Борат Сагдиев Apr 12 '19 at 17:09

2 Answers2

5

In constant space? Sure, Reservoir Sampling, constant space, linear time

Some lightly tested code

import numpy as np

def stream(size):
    for k in range(size):
        yield k

def resSample(ni, s):
    ret = np.empty(ni, dtype=np.int64)

    k = 0
    for k, v in enumerate(s):
        if k < ni:
            ret[k] = v
        else:
            idx = np.random.randint(0, k+1)
            if (idx < ni):
                ret[idx] = v

    return ret

SIZE = 12

s = stream(SIZE)
q = resSample(1, s)
print(q)

I see there is a question wrt RNG. Suppose I have true RNG, hardware device which returns single bit at a time. We use it only in the code where get index.

if (idx < ni):

The only way condition would be triggered for one element to be select is when ni=1 and thus idx only could be ZERO.

Thus np.random.randint(0, k+1) with such implementation would be something like

def trng(k):
    for _ in range(k+1):
        if next_true_bit():
            return 1 # as soon as it is not 0, we don't care
    return 0 # all bits are zero, index is zero, proceed with exchange

QED, such realization is possible and therefore this sampling method shall work

UPDATE

@kyrill is probably right - I have to have a count going (log2(k) storage), so far see no way to avoid it. Even with RNG trick, I have to sample 0 with probability 1/k and this k is growing with the size of the stream.

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
5

Generate a random number for each element, and remember the element with the smallest number.

That's the answer I like best, but the answer you're probably looking for is:

If the stream is N items long, then the probability of returning the Nth item is 1/N. Since this probability is different for every N, any machine that can accomplish this task must enter different states after reading streams of different lengths. Since the number of possible lengths is unbounded, the required number of possible states is unbounded, and the machine will require an unbounded amount of memory to distinguish between them.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • To achieve a truly uniform distribution, there would have to be as many possible random values as are there items in the stream (or a multiple of it). Then the biggest random number will be at least **n**, where n is the length of the stream, so it would require _O(log n)_ space. The fact that you don't _keep_ the biggest number doesn't matter, since at some point you will still have to generate it and hold it in memory / register. – Борат Сагдиев Apr 11 '19 at 23:25
  • I see in your other comment you suggested that this is not possible. In fact that answers my question, so if you post your comment as an answer and maybe elaborate a little, I will accept it. – Борат Сагдиев Apr 11 '19 at 23:38
  • 1
    I'd rather provide a simple and effective way to select uniformly from a stream in case some future searcher has a practical need to do so. – Matt Timmermans Apr 12 '19 at 00:00
  • @БоратСагдиев No, suppose you have true RNG. – Severin Pappadeux Apr 12 '19 at 00:10
  • @БоратСагдиев, Severin has goaded me into providing the answer you want :) Related to the Myhill Nerode theorem. – Matt Timmermans Apr 12 '19 at 00:42