26

How do I use random.shuffle() on a generator without initializing a list from the generator? Is that even possible? if not, how else should I use random.shuffle() on my list?

>>> import random
>>> random.seed(2)
>>> x = [1,2,3,4,5,6,7,8,9]
>>> def yielding(ls):
...     for i in ls:
...             yield i
... 
>>> for i in random.shuffle(yielding(x)):
...     print i
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/random.py", line 287, in shuffle
    for i in reversed(xrange(1, len(x))):
TypeError: object of type 'generator' has no len()

Note: random.seed() was designed such that it returns the same output after each script run?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 5
    that does not really make sense, as the point of a generator is that you don't know what are the elements and can't access them but in an orderly fashion – njzk2 Jan 17 '14 at 13:25
  • because the seed is supposed to be customized so in this case: `n=2; random.seed(2)`. Sometimes the random seed could be other number. – alvas Jan 17 '14 at 13:26
  • Can't imagine any canonique method to shuffle a sequence of unknown length. And note, that `random.shuffle` shuffles *in place*. – alko Jan 17 '14 at 13:28
  • 2
    Instead of a whole generator function, you could have used `iter(x)`. – Martijn Pieters Jan 17 '14 at 13:31
  • I would suggest using a poisson distribution for a positive random look-ahead. Then (lazily or not) ignore that element from the iterated object, then repeat. – mnish May 27 '18 at 05:40
  • How can you put the rest of the days of your life in a random order? How can you choose a random day from the rest of your life? You'd have to know how long you're going to live, right? – Karl Knechtel Sep 17 '22 at 12:41

7 Answers7

38

In order to shuffle the sequence uniformly, random.shuffle() needs to know how long the input is. A generator cannot provide this; you have to materialize it into a list:

lst = list(yielding(x))
random.shuffle(lst)
for i in lst:
    print i

You could, instead, use sorted() with random.random() as the key:

for i in sorted(yielding(x), key=lambda k: random.random()):
    print(i)

but since this also produces a list, there is little point in going this route.

Demo:

>>> import random
>>> x = [1,2,3,4,5,6,7,8,9]
>>> sorted(iter(x), key=lambda k: random.random())
[9, 7, 3, 2, 5, 4, 6, 1, 8]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • But this could produce duplicates? – thefourtheye Jan 17 '14 at 13:27
  • @thefourtheye: No. It might assign two elements the same "weight" but it won't duplicate the elements themselves. – Aaron Digulla Jan 17 '14 at 13:28
  • 1
    @thefourtheye: no, it just sorts the output of the `yielding(x)` generator using random values. – Martijn Pieters Jan 17 '14 at 13:29
  • 1
    I would have expected `sorted` to rely on the key function being deterministic (if not, there is probably a caching mechanism (which, on second though, makes sense, given that `sorted` does not know the complexity of the key function)) – njzk2 Jan 17 '14 at 13:30
  • 1
    The first thing `sorted()` does is storing all elements in the generator in a list, before even starting to compute the keys and sorting it. – Sven Marnach Jan 17 '14 at 13:31
  • Also, regarding the keyword part, it does make sense, as the second argument to sorted in cmp, which takes 2 arguments (although lambda x, y: random.choice([-1, 0, 1]) would work too) (or would it, or is there a risk of infinite loop?) – njzk2 Jan 17 '14 at 13:32
  • 1
    @njzk2: no, the key function is called only once for each value. – Martijn Pieters Jan 17 '14 at 13:32
  • @njzk2: `sorted()` calls the key function exactly once on each input element, in the order they are provided. – Sven Marnach Jan 17 '14 at 13:32
  • 1
    @njzk2: the key function is used in a *decorate-sort-undecorate* style sort; sorting a list of `(keyfunc(value), None, value)` values, then extracting `value` from that again. – Martijn Pieters Jan 17 '14 at 13:33
  • @njzk2: This would work in some way, but as a result you would favour certain permutations over others. – Sven Marnach Jan 17 '14 at 13:33
  • @njzk2: the `cmp` argument would normally have to be stable, but inconsistent results lead to random sorting instead of an infinite loop. I am not certain if the distribution will remain uniform in that case or if a bias is introduced, though. – Martijn Pieters Jan 17 '14 at 13:36
  • @MartijnPieters : isn't there a probability that the sort is never considered stable if cmp is not stable? (I just realized while writing this that no, most sorts do not re-verify elements once considered sorted.). – njzk2 Jan 17 '14 at 13:39
  • Quick question: Would `iter(sorted(iter(x), key=lambda k: random.random()))` first materialize the sorted list before casting it into an iterable? – alvas Jun 13 '18 at 08:34
  • 1
    @alvas: `sorted()` returns a list, always. `iter()` then produces an iterator for that list. Python doesn't alter how `sorted()` works based on what you use the returned object for. – Martijn Pieters Jun 13 '18 at 17:35
  • [Wikipedia](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Sorting) covers the flaws in the methods discussed by @njzk2 and the OP. `key=random.random()` probably represents a 52-bit random number, which is probably fine. `cmp=random.randrange(2)` not so good. Meanwhile, I've been severely nerd-sniped trying to dig up the reference and getting derailed by other trivia on my way there. – sh1 Oct 22 '22 at 05:37
  • @sh1: The default implementation returns [*multiples of 2⁻⁵³ in the range 0.0 ≤ x < 1.0*](https://docs.python.org/3/library/random.html#recipes). njzk2 was referring to the original Python 2 method of sorting, using a [`cmp()` function](https://docs.python.org/2/library/functions.html#cmp)' which takes 2 arguments. You can't supply `random.randrange(2)` to that ;-) – Martijn Pieters Nov 25 '22 at 14:48
  • Ok, a lambda of the same effect, then. Wikipedia discusses in detail why it's a bad idea. – sh1 Nov 26 '22 at 16:27
5

It's not possible to randomize the yield of a generator without temporarily saving all the elements somewhere. Luckily, this is pretty easy in Python:

tmp = list(yielding(x))
random.shuffle(tmp)
for i in tmp:
    print i

Note the call to list() which will read all items and put them into a list.

If you don't want to or can't store all elements, you will need to change the generator to yield in a random order.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • 1
    +1 for "change the generator to yield in a random order." This is what I did when I had OP's problem. You can add an optional bool param: `def yielding(ls, randomize=False)`. – Taylor Vance Jul 06 '22 at 16:07
5

Depending on the case, if you know how much data you have ahead of time, you can index the data and compute/read from it based on a shuffled index. This amounts to: 'don't use a generator for this problem', and without specific use-cases it's hard to come up with a general method.

Alternatively... If you need to use the generator...

it depends on 'how shuffled' you want the data. Of course, like folks have pointed out, generators don't have a length, so you need to at some point evaluate the generator, which could be expensive. If you don't need perfect randomness, you can introduce a shuffle buffer:

from itertools import islice

import numpy as np


def shuffle(generator, buffer_size):
    while True:
        buffer = list(islice(generator, buffer_size))
        if len(buffer) == 0:
            break
        np.random.shuffle(buffer)
        for item in buffer:
            yield item


shuffled_generator = shuffle(my_generator, 256)

This will shuffle data in chunks of buffer_size, so you can avoid memory issues if that is your limiting factor. Of course, this is not a truly random shuffle, so it shouldn't be used on something that's sorted, but if you just need to add some randomness to your data this may be a good solution.

sturgemeister
  • 436
  • 3
  • 9
  • i took your answer and improved it a bit... see alternative below, streamed not chunked. that way elements can migrate far beyond the buffer_size window. – Erik Aronesty Aug 28 '19 at 18:37
3

You could sample from arbitrary yielded results, generating a not fully random but somewhat shuffled set within a range. Similar to @sturgemeister code above, but not chunked.... there are no defined randomness boundaries.

For example:

def scramble(gen, buffer_size):
    buf = []
    i = iter(gen)
    while True:
        try:
            e = next(i)
            buf.append(e)
            if len(buf) >= buffer_size:
                choice = random.randint(0, len(buf)-1)
                buf[-1],buf[choice] = buf[choice],buf[-1]
                yield buf.pop()
        except StopIteration:
            random.shuffle(buf)
            yield from buf
            return

The results should be fully random within the buffer_size window:

for e in scramble(itertools.count(start=0, step=1), 1000):
    print(e)

For an arbitrary 1000 elements in this stream... they are random seeming. But looking at the overall trend (beyond 1000), it's clearly increasing.

To test, assert that this returns 1000 unique elements:

for e in scramble(range(1000), 100):
    print(e)
Erik Aronesty
  • 11,620
  • 5
  • 64
  • 44
  • Nice implementation and decent performance, in my tests around 10X slower than iterating over a standard list() object. I also wasn't aware of the functionality of `yield from`, interesting! – Scholar Oct 08 '20 at 08:27
2

I needed to find a solution to this problem so I could get expensive to compute elements in a shuffled order, without wasting computation by generating values. This is what I have come up with for your example. It involves making another function to index the first array.

You will need numpy installed

pip install numpy

The Code:

import numpy as np
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]

def shuffle_generator(lst):
    return (lst[idx] for idx in np.random.permutation(len(lst)))

def yielding(ls):
    for i in ls:
        yield i

# for i in random.shuffle(yielding(x)):
#    print i

for i in yielding(shuffle_generator(x)):
    print(i)
beiller
  • 3,105
  • 1
  • 11
  • 19
1

For very large sequences, if you know the sequence size in advance:

class subset_iterator:
    """
    an iterator class that returns K random samples from another sequence
    that has no random-access. Requires: the sequence length as input

    similar to random.sample

    :param it: iterator to the sequence
    :param seqlen: size of the sequence of :param it:
    :param K: output sequence size (number of samples in the subset)
    """

    def __init__(self, it, seqlen, K):
        self.it = it
        self.N = seqlen
        self.K = K

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            r = random()
            nextitem = next(self.it)
            if r <= float(self.K) / self.N:
                self.K -= 1
                self.N -= 1
                return nextitem
            else:
                self.N -= 1
shacharf
  • 21
  • 2
0

A generator follows a sequential access pattern. Shuffled data requires the exact opposite, a random access pattern.

In many applications, we can get away with local perturbations only, which relaxes the problem quite a lot.

Here is an example of an in memory shuffle buffer.

from random import randint
domain = (0, 1000)
buffer = [randint(*domain) for _ in range(50)]

for element in range(*domain):
    idx = randint(0, len(buffer)-1)
    element, buffer[idx] = buffer[idx], element
    print(element)
0-_-0
  • 1,313
  • 15
  • 15