1

Hi everyone I need to understand how to decompose an array to assign sub-blocks to a fixed number of processors. The case where the remainder among the number of elements% processes == 0 is simple, I would like to know a performing way to do it in case the remainder is different from 0. Maybe if it is possible to have a code example (in C using MPI) to better understand these wait. Furthermore, I would like to ask you which of:

  • blockwise decomposition
  • cyclic decomposition
  • block cyclic decomposition

it is more efficient (assuming that sending and receiving data has a certain cost), and if there is still something faster for that purpose. Thank you all.

1 Answers1

1

The simplest solution is to give every process N/P points, rounded down, and the last process the excess. That is also a bad solution: it means that with unbalanced load all processes will be waiting for the last one.

Next best: every process gets (N+P-1)/P points, rounding that fraction up. Now the last process gets a smaller number of points. That's a lot better: now one process will have some idle time.

Best solution I know is to assign each process the range defined as follows:

for (int p=0; p<=nprocs; p++)
  beginend[p] = p*npoints/nprocs;

Code it and try it out; you'll see that there is at most a one point spread between the largest and smallest number of points-per-process, and also the excess points are nicely spread out. Sample output:

1/5: 0 0 0 0 1
2/5: 0 0 1 0 1
3/5: 0 1 0 1 1
4/5: 0 1 1 1 1
5/5: 1 1 1 1 1
6/5: 1 1 1 1 2
7/5: 1 1 2 1 2
8/5: 1 2 1 2 2
9/5: 1 2 2 2 2
10/5: 2 2 2 2 2

So that's the blockwise solution. Doing it cyclically is possible too but often that's not as great from a point of cache use. This distribution is used for instance in an LU factorization, where gradually the first so-many rows/columns become inactive.

Block cyclic is more complicated, but a good combination of the advantages of block and cyclic.

Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23