Is there an efficient algorithm to redistribute a vector over processes based on how many elements each process has to give up or recieve?

Question

I'm trying to redistribute an array (mesh-like) over a set of processes for load-balancing needs. My special requirement is that array elements should only be moved to the spatially adjacent processes as only the elements near the front between elements can be moved easily.

In the above example setup, all first three processes should donate elements to the last one:

# Process, Neighbors, Nbr. of Elements to move in/out
0, (1 2), -23
1, (0 3), -32
2, (0 3), -13
3, (1 2), +68

Currently, I'm planning to implement this with blocking Two-way MPI comms where transactions happen similarly to the following:

P0 sends 23 elements to P1
P1 sends 55 elements to P3      (Should only send 32 originally, + the 23 it got from P0)
P2 sends 13 elements to P3

So, I was wondering if there is a known (easily-parallelized through Two-way MPI comms preferably) algorithm to deal with this kind of situations.

Also, I've thought about "flatting out" the processes, and considering they form a simple ring. This simplifies things but has the potential of being noisy and may not scale well:

P0 sends 23 elements to P1
P1 sends 55 elements to P2    (Even  though it's not one of its spacial neighbors)
P2 sends 68 elements to P3

Can Metis/ParMetis library handle this?

Is this going to run on a single system with shared memory or distributed? Because in the later case you can only use MPI methods. — Goswin von Brederlow, Mar 13 '22 at 12:12
@GoswinvonBrederlow Possibly multiple systems, distributed, but I also have to take care of the front between processes which I want to keep minimal — Elwardi, Mar 13 '22 at 12:14
If it is distributed the major cost will be copying over the network. I don't think you have to worry about anything but a) deciding early that you should send elements so they can be transmitted before the computation grinds to a halt, b) carefully calculate amounts so you never send elements back to the originator (either back or going around). — Goswin von Brederlow, Mar 13 '22 at 12:20
If you flatten it out you have a ring and every process has a left and right. Makes it easy to reason and implement. But what if you have a 3x3 grid? The "ring" would have to go diagonal at one point. Does that make sense in your model? Or do you only allow grids with even dimensions so there is a path only going to adjacent nodes? Or do larger grids share elements up/down/left/right? — Goswin von Brederlow, Mar 13 '22 at 12:23
@GoswinvonBrederlow If I flatten it out, no matter what the spatial configuration of the elements held by processes, there will be always one previous and one next process, and that next process is not guaranteed to be a neighbor. Maybe I should consider making that a requirement too — Elwardi, Mar 13 '22 at 12:27
If processes can only process spacial adjacent elements then you should. Having to send elements for A to B to C to D to ... until they hit the right process would be bad. — Goswin von Brederlow, Mar 13 '22 at 12:29
Then the problem will boil down to proper (virtual) ordering of processes, but not quite sure if the last and first processes will play nice with your b) point — Elwardi, Mar 13 '22 at 12:29
If it is a ring then it has no first and last. You connect the two to make a ring. — Goswin von Brederlow, Mar 13 '22 at 12:30
Well then I'll have to split the processes into multiple rings; as you can see in the example there is (0 1 3) and (2 3) rings, and I see no way putting everything in a single ring because 0 and 3 are not spatially connected. Thanks for the discussion! — Elwardi, Mar 13 '22 at 12:32
@gnasher729 Yep, this will be MPI-ready code, that was imposed by upstream libs — Elwardi, Mar 13 '22 at 12:34
Sounds like your problem is still deciding what processes should share data and not how you actually share it. — Goswin von Brederlow, Mar 13 '22 at 12:51
@GoswinvonBrederlow I just tried to implement the multi-rings thing and I realized it's actually ALWAYS possible to use only one ring, simply (2 0 1 3) in the example works — Elwardi, Mar 13 '22 at 12:52
You worry about scaling this up. So how does it look with 16 or 256 processes? — Goswin von Brederlow, Mar 13 '22 at 12:54
@GoswinvonBrederlow Kind of, yes, I know how many elements each process has to give up but not quite sure yet where to off-load those elements; but with only one ring, things are much simpler, I can know construct that ring based on neighboring processes. With this, some cells will travel for many processes for now — Elwardi, Mar 13 '22 at 12:54
Worst case scenario is cells will have to travel the full path still :( — Elwardi, Mar 13 '22 at 12:55
At this point, a full reconstruction and redistribution (based on target number of elements on each process) seems much better — Elwardi, Mar 13 '22 at 12:56

score 1 · Accepted Answer · answered Mar 15 '22 at 01:49

1

I'll generalize your question: you are looking for an algorithm for load balancing where processes are connected through a graph, and can only move load to graph-connected processes. This algorithm exists: it's known as "diffusion based load balancing" and it was originally proposed by Cybenko. A simple web search will give you a ton of references.

answered Mar 15 '22 at 01:49

Victor Eijkhout

5,088
2
22
23

Thanks! That's exactly what I'm looking for. However in my case, I sample the load using CPU Time, but move vector elements around; I can probably formulate a version of these diffusion-based LB algorithms; this will probably be the accepted answer. – Elwardi Mar 16 '22 at 08:59
I've had a similar application, where I was measuring CPU time on each process, from it deduced local time per unit data, and then solved a small linear system (size of number of processors) to decide how to move data. I'd have to dig into old notes to find the details. – Victor Eijkhout Mar 16 '22 at 10:37
That's what I'm doing to the letter! Now with the diffusion, I can satisfy the requirement that processes should only send to neighboring ones – Elwardi Mar 16 '22 at 12:43

Is there an efficient algorithm to redistribute a vector over processes based on how many elements each process has to give up or recieve?

1 Answers1