Iterating over a long iterator in an almost uniformly sampled random order without large memory requirements

Question

I want to iterate over integers in the range 0 to N-1, where N is a large number. This can easily be done with for i in range(N):.

However, I want to iterate the numbers in a random order. This can also easily be done using something like:

from random import shuffle
a = list(range(N))
shuffle(a)
for i in a:
      do_something(i)

The problem with this approach is that it requires storing in memory the entire list of numbers. (shuffle(range(N)) raises an error). This make it not practical for my purposes for large N.

I would like to have an object which is an iterator (just like range(N)), which does not store all numbers in the memory (again, just like range(N)), and which iterates in a random order.

Now, when I say "random order" I really mean that the order is sampled from the uniform distribution over the set of all permutations of (0,1,...,N-1). I know that this number is potentially very large (N!), and therefore if the iterator would need to represent which permutation it uses it would need to be very large in memory.

Therefore, I can settle on "random order" having the meaning of "looks like a uniform distribution although it is actually not", in some sense which I have not defined.

If I had such an iterator, this is how I would operate it:

a = random_order_range(N) # this object takes memory much smaller than then factorial of N
for i in a:
    do_something(i)

Any ideas how this can be done?

EDIT1:

Actually, what I am really interested in is that the memory consumption will be even less than ~N, if possible... Maybe something like O(k*N) for some k that could be much smaller than 1.

I assume you don't want to get duplicate integers? Also, do you plan to consume *all* integers, or only some of them? Depending on the answer, the solution could be very easy or quite hard ;) — Guybrush, Dec 19 '18 at 10:58
@Guybrush right, I don't want duplicate integers. And I do want to consume all integers. — Lior, Dec 20 '18 at 05:45

Hadi Farah · Answer 1 · 2018-12-19T13:23:39.490

import functools, random, itertools  
from collections import deque
import random
from bloom_filter import BloomFilter

def random_no_repeat(random_func, limit):
    already_returned = BloomFilter()
    count = 0
    while True:
        i = random_func()
        if i not in already_returned:
            count += 1
            already_returned.add(i)
            yield i
            if (count == limit):
                break

def count_iter_items(iterable):
    counter = itertools.count()
    deque(itertools.zip_longest(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

N = 1e5
random.seed(0)
random_gen = random_no_repeat(functools.partial(random.randint, 0, int(N)))

for index, i in  enumerate(random_gen):
    print(index, i)

Good answer. But there is no guarantee here that all integers in the range are generated. Am I right? — Lior, Dec 20 '18 at 05:57

Patrick Artner · Answer 2 · 2018-12-19T11:35:12.033

I am not so sure about the space and timing requirements, but this should be far less then N! - by fixing the limits low and high and the set of seen inner ones it should also not need overly long towards the end to draw a number then when you simply bruteforce from N and check if in seen:

import random 

def random_range(N): 
    seen = set()
    low = 0
    high = N
    seen = set()
    while low < high:
        k = random.choice(range(low,high))
        if k in seen:
            # already drafted - try again
            continue
        else:
            yield k

            seen.add(k)

            # fix lower
            while low in seen:
                seen.remove(low)
                low += 1

            # fix upper
            while high-1 in seen:
                seen.remove(high-1)
                high -= 1

for i in random_range(20):
    print(i, end = ", ")

Output:

7, 2, 5, 18, 11, 3, 6, 10, 14, 9, 15, 17, 19, 0, 16, 4, 1, 12, 13, 8,

If you plug in N as 2^63 the seen set will grow huge before it shrinks down because the probability of hitting the low or high spot is small - thats what makes about the most memory consumption.

The runtime gets worse the fuller seen is in respect to range(low,high) because it might need 2000 continues to hit a random number thats not in seen already:

# pseudo 
seen = { 1-99999,100001-99999999999 } 
low = 0
high = 99999999999+2

This would not be "reduceable" and there are only 3 numbers left to draw from range(0, 99999999999+2) - but the chance to get to such a thing is also kinda tiny.

Your choice ;o)

right, this approach takes O(N) memory (at least in the worst case) which is indeed much less than O(N!). However, there is a chance here of generating the same integer twice. Also, following this answer I edited the question since I realize now that I would actually like something smaller than O(N)... — Lior, Dec 20 '18 at 06:07
@Lior - where do you see the chance of an integer to be reported more then once? — Patrick Artner, Dec 20 '18 at 06:47
you are right, my mistake. There is no chance for an integer being yielded twice. — Lior, Dec 22 '18 at 10:59

Iterating over a long iterator in an almost uniformly sampled random order without large memory requirements

2 Answers2