-2

P1 P2
P3 P4

1 2 3 4
5 6 7 8
1 2 3 4
0 6 0 8

Suppose P1,P2,P3,P4 are processess and P1 has data points 1 2 5 6 , P2 has data points 3 4 7 8 P3 has data points 1 2 0 6 , P4 has datapoints 3 4 0 8. I want to peform stecil computation on this piece of data such that new value of 6 will be averageof(2,5,7,2) . However 7 is the data point of P2 and 2 is the data point of P3. How to solve this ? I am able to run this for single process but how to approach this by normal MPI_Send and MPI_Recv? also using type_contiguous or type_vector and collective communication?

Any help will be much appreciated.Thank you.

  • 3
    The usual solution is to surround each sub-array with one or more layers of cells, called "halo" or "ghost cells", and sync their content at each iteration. This is called "halo swap" and is most easily implemented using `MPI_Sendrecv`. See [here](https://stackoverflow.com/a/17591795/1374437). – Hristo Iliev Feb 19 '21 at 08:17

1 Answers1

2

There is a well-known solution to this problem that goes under the name of ghost cells or halos. The idea is to surround each sub-array with one or more additional layers of cells depending on the stencil. At the beginning of each iteration, each process syncs the state of the halo by exchanging data with its nearest neighbours, which operation is called halo swap. Halos provide the necessary data to compute the new state of all the inner cells, but they lack the necessary data for their own update, therefore the content of those cells gets "old" after one iteration and that's why they are sometimes called "ghosts".

This blog post presents the idea clearly, but the language used is Julia. Since the About page says it is fine to reuse the content elsewhere, I've borrowed shamelessly the following illustration:

Ghost cells. Taken from http://www.claudiobellei.com/2018/09/30/julia-mpi/

The figure shows halos only along the borders between the processes, but in practice, it is easier if you have halos on all four sides and only update the ones that are needed because it makes the code more symmetric and reduces its complexity.

The decomposition in your particular case goes like this:

1 2 3 4
5 6 7 8
1 2 3 4
0 6 0 8

first becomes

1 2 | 3 4
5 6 | 7 8
----+----
1 2 | 3 4
0 6 | 0 8

Add halos:

x x x x | x x x x
x 1 2 x | x 3 4 x
x 5 6 x | x 7 8 x
x x x x | x x x x
--------+--------
x x x x | x x x x
x 1 2 x | x 3 4 x
x 0 6 x | x 0 8 x
x x x x | x x x x

Now, before the stencil operation, you need to perform the halo swap. It goes in steps as shown below. Swaps are performed in pairs along each dimension. The order doesn't really matter.

1.1) Horizontal swap in the direction east

Each process sends its rightmost column to the process right of it. The receiver places the data in its left halo column:

x x [x] x | [x] x x x     x x [x] x | [x] x x x
x 1 [2] x | [x] 3 4 x     x 1 [2] x | [2] 3 4 x
x 5 [6] x | [x] 7 8 x     x 5 [6] x | [6] 7 8 x
x x [x] x | [x] x x x     x x [x] x | [x] x x x
----------+---------- --> ----------+----------
x x [x] x | [x] x x x     x x [x] x | [x] x x x
x 1 [2] x | [x] 3 4 x     x 1 [2] x | [2] 3 4 x
x 0 [6] x | [x] 0 8 x     x 0 [6] x | [6] 0 8 x
x x [x] x | [x] x x x     x x [x] x | [x] x x x

1.2) Horizontal swap in the direction west

Each process sends its leftmost column to the process left of it. The receiver places the data in its right halo column:

x x x [x] | x [x] x x     x x x [x] | x [x] x x
x 1 2 [x] | 2 [3] 4 x     x 1 2 [3] | 2 [3] 4 x
x 5 6 [x] | 6 [7] 8 x     x 5 6 [7] | 6 [7] 8 x
x x x [x] | x [x] x x     x x x [x] | x [x] x x
----------+---------- --> ----------+----------
x x x [x] | x [x] x x     x x x [x] | x [x] x x
x 1 2 [x] | 2 [3] 4 x     x 1 2 [3] | 2 [3] 4 x
x 0 6 [x] | 6 [0] 8 x     x 0 6 [0] | 6 [0] 8 x
x x x [x] | x [x] x x     x x x [x] | x [x] x x

2.1) Vertical swap in the direction south

Each process sends its bottom row to the process below it. The receiver places the data in its top halo row:

[x x x x | x x x x]   [x x x x | x x x x]
 x 1 2 3 | 2 3 4 x     x 1 2 3 | 2 3 4 x
[x 5 6 7 | 6 7 8 x]   [x 5 6 7 | 6 7 8 x]
 x x x x | x x x x     x x x x | x x x x
 --------+-------- --> --------+--------
[x x x x | x x x x]   [x 5 6 7 | 6 7 8 x]
 x 1 2 3 | 2 3 4 x     x 1 2 3 | 2 3 4 x
[x 0 6 0 | 6 0 8 x]   [x 0 6 0 | 6 0 8 x]
 x x x x | x x x x     x x x x | x x x x

2.2) Vertical swap in the direction north

Each process sends its top row to the process above it. The receiver places the data in its bottom halo row:

 x x x x | x x x x     x x x x | x x x x
[x 1 2 3 | 2 3 4 x]   [x 1 2 3 | 2 3 4 x]
 x 5 6 7 | 6 7 8 x     x 5 6 7 | 6 7 8 x
[x x x x | x x x x]   [x 1 2 3 | 2 3 4 x]
 --------+-------- --> --------+--------
 x 5 6 7 | 6 7 8 x     x 5 6 7 | 6 7 8 x
[x 1 2 3 | 2 3 4 x]   [x 1 2 3 | 2 3 4 x]
 x 0 6 0 | 6 0 8 x     x 0 6 0 | 6 0 8 x
[x x x x | x x x x]   [x x x x | x x x x]

Each of those operations can be implemented with a single MPI call - MPI_Sendrecv. The send part sends the local data row or column and the receive part receives it into the local halo row or column.

The final result after the four swaps is:

x x x x | x x x x
x 1 2 3 | 2 3 4 x
x 5 6 7 | 6 7 8 x
x 1 2 3 | 2 3 4 x
--------+--------
x 5 6 7 | 6 7 8 x
x 1 2 3 | 2 3 4 x
x 0 6 0 | 6 0 8 x
x x x x | x x x x

You'll probably notice that even the corner elements in each halo contain the right diagonal element that one would expect to be there. The beauty of this is that you can simply add more steps to extend it to as many dimensions as you need and all elements automagically find their right position in the halo region.

You now have all four neigbhours of 6 available locally and can proceed with averaging their values. Note that you only use the values in the halos and do not update them.

Some additional notes:

  1. This works no matter whether you have periodic boundary conditions or not. With periodic boundary conditions P2 is the right neighbour of P1 and P1 is the right neighbour of P2 when doing the right shift. Also, P1 is the left neighbour of P2 and P2 is the left neighbour of P1. The same applies in vertical direction.

  2. You do not need special code to handle processes that are on the boundary. If a process doesn't have a right (or left, or top, or bottom) neighbour, simply send or receive a message to/from MPI_PROC_NULL. I.e., the code goes like this:

    int right = compute_right_rank();
    int left = compute_left_rank();
    int up = compute_top_rank();
    int down = compute_bottom_rank();
    
    MPI_Sendrecv(right_column, 1, columndt, right, 0,
                 left_halo, 1, columndt, left, 0,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    
    MPI_Sendrecv(left_column, 1, columndt, left, 0,
                 right_halo, 1, columndt, right, 0,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    
    MPI_Sendrecv(bottom_row, 1, rowdt, down, 0,
                 top_halo, 1, rowdt, up, 0,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    
    MPI_Sendrecv(top_column, 1, rowdt, up, 0,
                 bottom_halo, 1, rowdt, down, 0,
                 MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    

    Here, compute_right_rank() should return rank + 1 if there is a rank to the right or MPI_PROC_NULL otherwise. Sending to MPI_PROC_NULL or receiving from it is a no-op, i.e., nothing happens. It allows you to write code without ifs.

    columndt is an MPI datatype that corresponds to a column in the array. You can construct it using MPI_Type_vector. rowdt is an MPI datatype that represents a whole row in the array. Construct it using MPI_Type_contiguous.

  3. It is super easy to compute the ranks of the neighbours if the ranks are in a Cartesian communicator. MPI_Cart_create and MPI_Cart_shift are your best friends here. You may also use MPI neighbour collectives to reduce the number of MPI calls even further.

  4. The bottom halos of the bottom ranks are not filled because the boundary conditions are not periodic. The same is the case for the right halos of the rightmost ranks, the top halos of the top ranks, and the left halos of the leftmost ranks. You may want to pre-fill them with a special value, e.g., 0. That value will never change because there is no communication that puts something there.

  5. If your computation is iterative, you must perform the halo before each iteration. If the halo swap is too slow compared to the computation, you may increase the thickness of the halo and make it two or more layers thick. Use the outer layers to update the values in the inner layers. A halo three layers thick requires swaps every three iterations.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186