0

I have a number of objects (roughly 530,000). These objects are randomly assigned to a set of lists (not actually random but let's assume it is). These lists are indexed consecutively and assigned to a dictionary, called groups, according to their index. I know the total number of objects but I do not know the length of each list ahead of time (which in this particular case happens to vary between 1 and 36000).

Next I have to process each object contained in the lists. In order to speed up this operation I am using MPI to send them to different processes. The naive way to do this is to simply assign each process len(groups)/size (where size contains the number of processes used) lists, assign any possible remainder, have it process the contained objects, return the results and wait. This obviously means, however, that if one process gets, say, a lot of very short lists and another all the very long lists the first process will sit idle most of the time and the performance gain will not be very large.

What would be the most efficient way to assign the lists? One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible. But I am not sure how to best implement this. Does anybody have any suggestions?

Zulan
  • 21,896
  • 6
  • 49
  • 109
P-M
  • 1,279
  • 2
  • 21
  • 35

2 Answers2

3

One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible.

Assuming that processing time scales exactly with the sum of list lengths, and your processor capacity is homogeneous, this is in fact what you want. This is called the multiprocessor scheduling problem, which is very close to the bin packing problem, but with a constant number of bins minimizing the maximum capacity.

Generally this is a NP-hard problem, so you will not get a perfect solution. The simplest reasonable approach is to greedily pick the largest chunk of work for the processor that has the smallest work assigned to it yet.

It is trivial to implement this in python (examples uses a list of lists):

greedy = [[] for _ in range(nprocs)]
for group in sorted(groups, key=len, reverse=True):
    smallest_index = np.argmin([sum(map(len, assignment)) for assignment in greedy])
    greedy[smallest_index].append(group)

If you have a large number of processors you may want to optimize the smallest_index computation by using a priority queue. This will produce significantly better results than the naive sorted split as recommended by Attersson:

resulting imbalance in implementations

(https://gist.github.com/Zulan/cef67fa436acd8edc5e5636482a239f8)

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Thank you, I was having a similar thought this morning, though didn't get round to playing with it yet. Could you quickly expand on what you mean by "optimise the `smallest_index` computation by using a priority queue"? – P-M Jul 05 '18 at 11:01
  • If you do not want to search the entire set of processors for the one with the least work, you must use and maintain a list of processor assignments sorted by their current work so you can find the one with the least work on `O(log P)`. Python isn't so great with sorted data structures, so you could go with something from [sortedcollections](https://pypi.org/project/sortedcollections/). – Zulan Jul 05 '18 at 13:41
  • 1
    (In amendment of the previous comment) Great solution, I got it! Thanks for your code and the comparison between the three algorithms. – Attersson Jul 05 '18 at 19:22
2

On the assumption that a longer list has a larger memory size,your_list has a memory size retrievable by the following code:

import sys
sys.getsizeof(your_list)

(Note: it depends on Python implementation. Please read How many bytes per element are there in a Python list (tuple)?)

There are several ways you can proceed then. If your original "pipeline" of lists can be sorted by key=sys.getSizeof you can then slice and assign to process N every Nth element (Pythonic way to return list of every nth item in a larger list).

Example:

sorted_pipeline = [list1,list2,list3,.......]
sorted_pipeline[0::10] # every 10th item, assign to the first sub-process of 10

This will balance loads in a fair manner, while keeping complexity O(NlogN) due to the original sort and then constant (or linear if the lists are copied) to assign the lists.

Illustration (as requested) of splitting 10 elements into 3 groups:

>>> my_list = [0,1,2,3,4,5,6,7,8,9]
>>> my_list[0::3]
[0, 3, 6, 9]
>>> my_list[1::3]
[1, 4, 7]
>>> my_list[2::3]
[2, 5, 8]

And the final solution:

assigned_groups = {}
for i in xrange(size):
    assigned_groups[i] = sorted_pipeline[i::size]

If this is not possible, you can always keep a counter of total queue size, per sub-process pipeline, and tweak probability or selection logic to take that into account.

Attersson
  • 4,755
  • 1
  • 15
  • 29