0

I want to split m*n elements (e.g., 1, 2, ..., m*n) into n group randomly and evenly such that each group has m random elements. Each group will process k (k>=1) elements at one time from its own group and at the same speed (via some synchronization mechanism), until all group has processed all their own elements. Actually each group is in an independent process/thread.

I use numpy.random.choice(m*n, m*n, replace=False) to generate the permutation first, and then index the permuted result from each group.

The problem is that when m*n is very large (e.g., >=1e8), the speed is very slow (tens of seconds or minutes).

Is there any faster/lazier way to do this? I think maybe this can be done in a lazier way, which is not generating the permuted result in the first time, but generate a generator first, and in each group, generate k elements at each time, and its effect should be identical to the method I currently use. But I don't know how to achieve this lazy way. And I am not sure whether this can be implemented actually.

Daniel
  • 1,783
  • 2
  • 15
  • 25
  • 1
    Your goal appears to be to generate a permutation of N items by multiple threads in parallel. The following may point you in the right direction: https://github.com/lorenzhs/sampling . Also, generating a permutation is equivalent to generating N exponential variates and sorting them (https://arxiv.org/pdf/1903.00227.pdf). If that helped you find an answer, you can post it. – Peter O. Apr 15 '21 at 11:31
  • @PeterO. Thanks! It seems promising! I will have a try first. – Daniel Apr 15 '21 at 13:25
  • Did you find a solution? If so you should post that solution as an answer. – Peter O. May 04 '21 at 15:21
  • @PeterO. I haven't found a satisfying solution, but I comprised and implement a sequence server to generate one number at each time using fisher-yates algorithm, and put the generated number into `n` queue for the `n` processes to get from. – Daniel May 06 '21 at 05:17

1 Answers1

0

You can make a generator that will progressively shuffle (a copy of) the list and lazily yield distinct groups:

import random
def rndGroups(A,size):
    A = A.copy()                    # work on a copy (if needed)
    p = len(A)                      # target position of random item
    for _ in range(0,len(A),size):  # work in chunks of group size
        for _ in range(size):       # Create one group 
            i = random.randrange(p) # random index in remaining items
            p -= 1                  # update randomized position
            A[i],A[p] = A[p],A[i]   # swap items
        yield A[p:p+size]           # return shuffled sub-range

Output:

A  = list(range(100))
iG = iter(rndGroups(A,10)) # 10 groups of 10 items
s  = set()                 # set to validate uniqueness
for _ in range(10):  # 10 groups
    g = next(iG)     # get the next group from generator
    s.update(g)      # to check that all items are distinct
    print(g)
print(len(s))        # must get 100 distinct values from groups

[87, 19, 85, 90, 35, 55, 86, 58, 96, 68]
[38, 92, 93, 78, 39, 62, 43, 20, 66, 44]
[34, 75, 72, 50, 42, 52, 60, 81, 80, 41]
[13, 14, 83, 28, 53, 5, 94, 67, 79, 95]
[9, 33, 0, 76, 4, 23, 2, 3, 32, 65]
[61, 24, 31, 77, 36, 40, 47, 49, 7, 97]
[63, 15, 29, 25, 11, 82, 71, 89, 91, 30]
[12, 22, 99, 37, 73, 69, 45, 1, 88, 51]
[74, 70, 98, 26, 59, 6, 64, 46, 27, 21]
[48, 17, 18, 8, 54, 10, 57, 84, 16, 56]
100

This will take just as long as pre-shuffling the list (if not longer) but it will let you start/feed threads as you go, thus augmenting the parallelism

Alain T.
  • 40,517
  • 4
  • 31
  • 51