I am trying to distribute my matrix in block cyclic fashion. I learned a lot from this question (MPI IO Reading and Writing Block Cyclic Matrix), but that is not what I really need.
Let me explain my problem.
Suppose I have this matrix of dimension 12 x 12 which I want to distribute over a processor grid of dimension 2 x 3 such that first processor gets bolded elements:
A = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
So, basically, I want to partition my matrix in blocks of dimensions 2 x 2, and then distribute those blocks to processors (numbered from 1 to 6) in this way:
1 2 3 1 2 3 4 5 6 4 5 6 1 2 3 1 2 3 4 5 6 4 5 6
I tried to achieve that as in above linked question, but the problem is that my local array for first processor is formed column-wise, i.e. it look like this
1, 13, 49, 61, 97, 109, 2, 14, 50, 62, 98, 110, 7, 19, 55, 67, 103, 115, 8, 20, 56, 68, 104, 116
This is my C code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "mpi.h"
#define N 12
#define P 2
#define Q 3
int main(int argc, char **argv) {
int rank;
int size;
double *A;
int A_size;
MPI_Datatype filetype;
MPI_File fin;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/**
* Reading from file.
*/
int gsizes[2], distribs[2], dargs[2], psizes[2];
gsizes[0] = N; /* no. of rows in global array */
gsizes[1] = N; /* no. of columns in global array*/
distribs[0] = MPI_DISTRIBUTE_CYCLIC;
distribs[1] = MPI_DISTRIBUTE_CYCLIC;
dargs[0] = 2; // no of rows in block
dargs[1] = 2; // no of cols in block
psizes[0] = P; /* no. of processes in vertical dimension
of process grid */
psizes[1] = Q; /* no. of processes in horizontal dimension
of process grid */
MPI_Type_create_darray(P * Q, rank, 2, gsizes, distribs, dargs, psizes,
MPI_ORDER_FORTRAN, MPI_DOUBLE, &filetype);
MPI_Type_commit(&filetype);
MPI_File_open(MPI_COMM_WORLD, "A.txt",
MPI_MODE_RDONLY,
MPI_INFO_NULL, &fin);
MPI_File_set_view(fin, 0, MPI_DOUBLE, filetype, "native",
MPI_INFO_NULL);
A_size = (N * N) / (P * Q);
A = (double*) malloc(A_size * sizeof(double));
MPI_File_read_all(fin, A, A_size,
MPI_DOUBLE, &status);
MPI_File_close(&fin);
printf("\n======\ni = %d\n", rank);
printf("A : ");
for (int i = 0; i < A_size; i++) {
printf("%lg ", A[i]);
}
MPI_Finalize();
return 0;
}
What I really want is that those 2 x 2 blocks are written consecutive, i.e. that local array of first processor looks like this;
1, 13, 2, 14, 49, 61, 50, 62, 97, 109, 98, 110, ...
I assume that I will need to define another MPI_Datatype (like vector or subarray), but I just cannot figure it out how would I achieve that.
Edit
I think I have partially solved my problem. Basically, each processor will end up with 4 x 6 matrix in FORTRAN order, and then with MPI_Create_subarray(...) I can easily extract 2 x 2 block.
But I want that each processor sends its block-row to each processor in same column and vice-versa. Processors are numbered in grid
1 2 3 4 5 6
so, for example, in first step, processor 1 should send its block-row
1 2 7 8 13 14 19 20
to processor 4; and its block-column
1 2 13 14 49 50 61 62 97 98 109 110
to processors 2 and 3.
I created Cartesian communicator, and used MPI_Cart_sub() to create row-wise and column-wise communicators, too.
I think I should use MPI_Bcast(), but I do not know how to combine MPI_Bcast() with MPI_Type_create_subarray(). I should first copy extracted subarray to some local_array and then Bcast(local_array). However, MPI_Type_create_subarray() gives me only "view" on subarray, not actually it, so the best solution I came up with is to Isend-Irecv root->root.
Is there a more elegant solution?