How to get block cyclic distribution?

Question

I am trying to distribute my matrix in block cyclic fashion. I learned a lot from this question (MPI IO Reading and Writing Block Cyclic Matrix), but that is not what I really need.

Let me explain my problem.

Suppose I have this matrix of dimension 12 x 12 which I want to distribute over a processor grid of dimension 2 x 3 such that first processor gets bolded elements:

A =

     1     2     3     4     5     6     7     8     9    10    11    12
    13    14    15    16    17    18    19    20    21    22    23    24
    25    26    27    28    29    30    31    32    33    34    35    36
    37    38    39    40    41    42    43    44    45    46    47    48
    49    50    51    52    53    54    55    56    57    58    59    60
    61    62    63    64    65    66    67    68    69    70    71    72
    73    74    75    76    77    78    79    80    81    82    83    84
    85    86    87    88    89    90    91    92    93    94    95    96
    97    98    99   100   101   102   103   104   105   106   107   108
   109   110   111   112   113   114   115   116   117   118   119   120
   121   122   123   124   125   126   127   128   129   130   131   132
   133   134   135   136   137   138   139   140   141   142   143   144

So, basically, I want to partition my matrix in blocks of dimensions 2 x 2, and then distribute those blocks to processors (numbered from 1 to 6) in this way:

1 2 3 1 2 3
4 5 6 4 5 6
1 2 3 1 2 3
4 5 6 4 5 6

I tried to achieve that as in above linked question, but the problem is that my local array for first processor is formed column-wise, i.e. it look like this

1, 13, 49, 61, 97, 109, 2, 14, 50, 62, 98, 110, 7, 19, 55, 67, 103, 115, 8, 20, 56, 68, 104, 116

This is my C code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "mpi.h"

#define     N           12
#define     P           2
#define     Q           3

int main(int argc, char **argv) {
    int rank;
    int size;

    double *A;
    int A_size;
    
    MPI_Datatype filetype;
    MPI_File fin;

    MPI_Status status;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    /**
     * Reading from file.
     */
    int gsizes[2], distribs[2], dargs[2], psizes[2];

    gsizes[0] = N; /* no. of rows in global array */
    gsizes[1] = N; /* no. of columns in global array*/

    distribs[0] = MPI_DISTRIBUTE_CYCLIC;
    distribs[1] = MPI_DISTRIBUTE_CYCLIC;

    dargs[0] = 2; // no of rows in block
    dargs[1] = 2; // no of cols in block

    psizes[0] = P; /* no. of processes in vertical dimension
     of process grid */
    psizes[1] = Q; /* no. of processes in horizontal dimension
     of process grid */

    MPI_Type_create_darray(P * Q, rank, 2, gsizes, distribs, dargs, psizes,
            MPI_ORDER_FORTRAN, MPI_DOUBLE, &filetype);
    MPI_Type_commit(&filetype);

    MPI_File_open(MPI_COMM_WORLD, "A.txt",
            MPI_MODE_RDONLY,
            MPI_INFO_NULL, &fin);

    MPI_File_set_view(fin, 0, MPI_DOUBLE, filetype, "native",
            MPI_INFO_NULL);

    A_size = (N * N) / (P * Q);
    A = (double*) malloc(A_size * sizeof(double));
    MPI_File_read_all(fin, A, A_size,
            MPI_DOUBLE, &status);

    MPI_File_close(&fin);

    printf("\n======\ni = %d\n", rank);
    printf("A : ");
    for (int i = 0; i &lt A_size; i++) {
        printf("%lg ", A[i]);
    }

    MPI_Finalize();
    return 0;
}

What I really want is that those 2 x 2 blocks are written consecutive, i.e. that local array of first processor looks like this;

1, 13, 2, 14, 49, 61, 50, 62, 97, 109, 98, 110, ...

I assume that I will need to define another MPI_Datatype (like vector or subarray), but I just cannot figure it out how would I achieve that.

Edit

I think I have partially solved my problem. Basically, each processor will end up with 4 x 6 matrix in FORTRAN order, and then with MPI_Create_subarray(...) I can easily extract 2 x 2 block.

But I want that each processor sends its block-row to each processor in same column and vice-versa. Processors are numbered in grid

1 2 3
4 5 6

so, for example, in first step, processor 1 should send its block-row

1  2  7  8
13 14 19 20

to processor 4; and its block-column

to processors 2 and 3.

I created Cartesian communicator, and used MPI_Cart_sub() to create row-wise and column-wise communicators, too.

I think I should use MPI_Bcast(), but I do not know how to combine MPI_Bcast() with MPI_Type_create_subarray(). I should first copy extracted subarray to some local_array and then Bcast(local_array). However, MPI_Type_create_subarray() gives me only "view" on subarray, not actually it, so the best solution I came up with is to Isend-Irecv root->root.

Is there a more elegant solution?

Your problem may have different solutions but it really depends on your constraint (memory, performance). I would consider having 2 matrices, one which is organized as matrix[row][column] and one that's matrix[column][row]. — Martin, Aug 22 '22 at 14:18

score 1 · Answer 1 · answered Apr 28 '23 at 11:05

It seems that your implementation is correct so far. The issue is with how you are printing the local array for the first processor. The local array is formed column-wise because of how the file is being read and the data is distributed among the processes.

To print the local array for the first processor in row-wise fashion, you can modify your code as follows:

if (rank == 0) {
    printf("\nLocal Array for Processor 0\n");
    for (i = 0; i < P * Q; i++) {
        for (j = 0; j < dargs[0] * dargs[1]; j++) {
            printf("%.0f ", A[i * dargs[0] * dargs[1] + j]);
            if ((j + 1) % dargs[1] == 0)
                printf("\n");
        }
        printf("\n");
    }
}

This will print the local array for the first processor in row-wise fashion as shown below:

Local Array for Processor 0
1  2  3  4 
13 14 15 16

49 50 51 52
61 62 63 64

97 98 99 100
109 110 111 112

2  3  4  5 
14 15 16 17

50 51 52 53
62 63 64 65

103 104 105 106
115 116 117 118

Note that the output shows only the local array for the first processor. You will need to modify the loop to print the local array for each processor in a similar way.

How to get block cyclic distribution?

1 Answers1