6

Python question. I'm generating a large array of objects, which I only need to make a small random sample. Actually generating the objects in question takes a while, so I wonder if it would be possible to somehow skip over those objects that don't need generating and only explicitly create those objects that have been sampled.

In other words, I now have

a = createHugeArray()
s = random.sample(a,len(a)*0.001)

which is rather wasteful. I would prefer something more lazy like

a = createArrayGenerator()
s = random.sample(a,len(a)*0.001)

I don't know if this works. The documentation on random.sample isn't too clear, though it mentions xrange as being very fast - which makes me believe it might work. Converting the array creation to a generator would be a bit of work (my knowledge of generators is very rusty), so I want to know if this works in advance. :)

An alternative I can see is to make a random sample via xrange, and only generate those objects that are actually selected by index. That's not very clean though, because the indices generated are arbitrary and unnecessary, and I would need rather hacky logic to support this in my generateHugeArray method.

For bonus points: how does random.sample actually work? Especially, how does it work if it doesn't know the size of the population in advance, as with generators like xrange?

Verhoevenv
  • 147
  • 1
  • 4
  • What kind of data does generateHugeArray() create? – Rafe Kettler Nov 26 '10 at 16:17
  • Currently, a plain old list of objects, with all possible permutations of a certain state space of which I only need a small sample.It's for a game of Quantum Werewolf in particular, see https://github.com/verhoevenv/bra-ket-wolf/blob/master/multiverse.py#L87 where I create all Universes first and then create filteredUniverses. – Verhoevenv Nov 26 '10 at 16:26

4 Answers4

2

There does not seem a way that avoids figuring out how the indices map to your permutations. If you don't know this, how would you create a random object from your array? You could either use the trick using xrange() you suggested yourself, or implement a class defining the __getitem__() and __len__() methods and pass and object of this class as population argument to random.sample().

Some further comments:

  • Converting createHugeArray() into a generator won't buy you anything -- random.sample() will just not work anymore. It needs an object supporting len().

  • So it does need to know the number of elements in the population right from the beginning.

  • The implementation features two different algorithms and chooses the one that will use less memory. For relatively small k (that is, in the case at hand) it will simply save the indices already chosen in a set and make a new random choice if it hits one of them.

Edit: A completely different approach would be to iterate over all permutations once and decide for each permutation if it should be included. If the total number of permutations is n and you would like to select k of them, you could write

selected = []
for i in xrange(n):
    perm = nextPermutation()
    if random.random() < float(k-len(selected))/(n-i):
        selected.append(perm)

This would choose exactly k permutations randomly.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • Ah. Interesting. In my particular case, __len__ should be possible with some combinatorics, __getitem__ will need some thought though. Thanks! – Verhoevenv Nov 26 '10 at 17:04
  • 1
    On second thought: if I use the xrange() trick, then sort the sampled indices, I only have to loop through all permutations exactly once, much simplifying the whole thing. Also, boo to Markdown and Python methods. :) – Verhoevenv Nov 26 '10 at 17:13
  • Your last algorithm assumes that `permutation` has `__len__()` (`n` is known) and `__getitem__()` (trivial to implement in terms of `permutation(i)`). Why might one use it instead of `random.sample()`? – jfs Nov 26 '10 at 19:34
  • @J.F. Sebastian: The idea is to iterate over all the permutations and only store the ones that are selected. The text says this, while the code does not make this really clear. The advantage is that it is not necessary to store the whole array in memory. The advantage over the algorithm suggested by Verhoevenv in the above comment is that it is simpler. – Sven Marnach Nov 26 '10 at 20:57
0

You could create a list of array indexes with sample and then generate the objects according to the results:

def get_object(index):
    return MyClass(index)

or something like this. Then use sample to generate the indexes you need and call this function with those indexes:

objs = map(get_object, random.sample(range(length), 0.001 * length))

This is a little indirect as in it only chooses from a list of possible array indexes.

terminus
  • 13,745
  • 8
  • 34
  • 37
  • That's about what I meant with my alternative, yes, but your explanation is probably clearer. :) Like I said, not a very Pythonic solution IMO, but if nothing else comes up I guess it's okay. – Verhoevenv Nov 26 '10 at 16:42
0

Explaining how random.sample works,

random.sample(container, k) will return k number of values randomly from the container. Because a generator is iterable like lists, tuples and the keys or values in dicts it will iterate through the container and then take these random elements.

e.g. random.sample(xrange(111),4) will return something like [33,52,111,1] as k = 4 meaning 4 random numbers from the xrange generator upto 111.

Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91
  • Yeah, I understand the documentation, but I don't see what kind of algorithm they are using for it. Guess I can just take a look in the source though. – Verhoevenv Nov 26 '10 at 16:40
0

I'm guessing the function createHugeArray() contains a piece of code that is repeated once for each object that is created. And I'm guessing the objects are generated from some sort of initial value or seed, in which case createHugeArray() looks something like this:

def createHugeArray( list_of_seeds ):
  huge_array = []                  
  for i in list_of_seeds:
    my_object = makeObject( i )
    huge_array.append( my_object )           
  return huge_array

(I used lists not arrays, but you get the idea.)

To do the random sampling before actually creating the objects, just add a line that generates a random number, and then only create the object if the random number is below a certain threshold. Say you only want one object in a thousand. random.randint(0,999) gives a number from 0 to 999 - so only generate an object if you get zero. The code above becomes:

import random

def createHugeArray( list_of_seeds ):
  huge_array = [] 

  for i in list_of_seeds:
    die_roll = random.randint(0,999)

    if( die_roll == 0 ):
      my_object = makeObject( i )
      huge_array.append( my_object ) 
  return huge_array

Of course if my guess about how your code works is wrong then this is useless to you, in which case sorry and good luck :-)

user416625
  • 213
  • 1
  • 4
  • 10