MPI I/O, mix of single- and multiple-process output

Question

I need an MPI C code to write data to a binary file via MPI I/O. I need process 0 to write a short header, then I need the whole range of processes to write their own pieces of the array indicated by the header. Then I need process 0 to write another header, followed by all processes writing their pieces of the next array, etc. I came up with the following test code which actually does what I want. No one will be more surprised about that than me.

My question is, I am new at MPI I/O. So am I "getting it"? Am I doing this the "right way" or is there some more efficient or compact way to do it?

Code is: (BTW, if you think of testing this, try it with 4 procs only.)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "mpi.h"

#define ROWS 9
#define COLS 10

int main(int argc, char *argv[]) {

   int size_mpi, rank_mpi, row_mpi, col_mpi;
   int i,j,p,ttlcols;
   int sizes[]= {2*ROWS,2*COLS};
   int subsizes[]= {ROWS,COLS};
   int starts[] = {0,0};
   int vals[ROWS][COLS];
   char hdr[] = "This is just a header.\n";
   MPI_Status stat_mpi;
   MPI_Datatype subarray;
   MPI_File fh;
   MPI_Offset offset, end_of_hdr;
   MPI_Info info_mpi;

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD,&size_mpi);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank_mpi);

   ttlcols = 2*COLS;
   /* Where are we in the array of processes? */
   col_mpi = rank_mpi%2;
   row_mpi = rank_mpi/2;
   /* Populate the array */
   for (j=0; j<ROWS; j++){
      for (i=0; i<COLS; i++){
         vals[j][i] = ttlcols*(ROWS*row_mpi + j) +
                      COLS*col_mpi + i;
      }
   } 
   /* MPI derived datatype for setting a file view */    
   starts[0] = row_mpi*ROWS;
   starts[1] = col_mpi*COLS;
   MPI_Type_create_subarray(2, sizes, subsizes, starts,
                            MPI_ORDER_C, MPI_INT,
                            &subarray); 
   MPI_Type_commit(&subarray);
   /* open the file */    
   printf("opening file\n");
   MPI_File_open(MPI_COMM_WORLD, "arrdata.dat", 
                 MPI_MODE_WRONLY | MPI_MODE_CREATE,
                 MPI_INFO_NULL, &fh);
   printf("opened file\n");
   /* set the initial file view */    
   MPI_File_set_view(fh, 0, MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL);
   /* proc 0 writes first header */    
   if (rank_mpi == 0) {
      MPI_File_write(fh, (void*)hdr, strlen(hdr), MPI_CHAR, &stat_mpi);
      MPI_File_get_position(fh, &offset);
      MPI_File_get_byte_offset(fh, offset, &end_of_hdr); 
   }
   /* everybody has to know where proc 0 stopped writing */    
   MPI_Bcast((void*)&end_of_hdr, 1, MPI_INT, 0, MPI_COMM_WORLD);
   /* re-set file view for writing first array */    
   MPI_File_set_view(fh, end_of_hdr, MPI_INT,
                     subarray, "native",
                     MPI_INFO_NULL);
   /* and write the array */    
   MPI_File_write(fh, (void*)vals, ROWS*COLS, MPI_INT,
                  &stat_mpi);

   /* now go through the whole thing again to test */
   MPI_File_get_position(fh, &offset);
   MPI_File_get_byte_offset(fh, offset, &end_of_hdr); 
   MPI_File_set_view(fh, end_of_hdr, MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL);
   if (rank_mpi == 0) {
      MPI_File_write(fh, (void*)hdr, strlen(hdr), MPI_CHAR, &stat_mpi);
      MPI_File_get_position(fh, &offset);
      MPI_File_get_byte_offset(fh, offset, &end_of_hdr); 
   }

   MPI_Bcast((void*)&end_of_hdr, 1, MPI_INT, 0, MPI_COMM_WORLD);

   MPI_File_set_view(fh, end_of_hdr, MPI_INT,
                     subarray, "native",
                     MPI_INFO_NULL);
   MPI_File_write(fh, (void*)vals, ROWS*COLS, MPI_INT,
                  &stat_mpi);
   MPI_File_close(&fh);

   MPI_Finalize();

   return 0;

}

Do you always know the header sizes upfront, and are they all identical (the sizes)? If so, you can create a view for process #0 that include the header and data, and another for the other processes that only includes the data. Then you'd have only one call to make to `MPI_File_set_view()` and 2 calls to `MPI_File_write()` per iteration for rank #0 and one for the other ranks. — Gilles, Jun 15 '16 at 15:54
@Gilles I know the header sizes upfront. They are not all identical, unfortunately. I can't change that. Not my design. — bob.sacamento, Jun 15 '16 at 16:27
There is a bug in your code: `end_of_hdr` is of type `MPI_Offset` and you are broadcasting it using `MPI_INT`. `MPI_Offset` is typically 64-bit while `MPI_INT` corresponds to `int`, which on LP64 Unix systems (*BSD, Linux, Solaris) is only 32-bit. On a big endian system, a very wrong offset will get broadcasted. Use `MPI_OFFSET` (if supported by the MPI implementation) or the corresponding C type (check your `mpi.h`) — Hristo Iliev, Jun 16 '16 at 09:14
@HristoIliev Thank you. I was wondering about that. Couldn't find anything about "MPI_OFFSET" in the standard so I tried MPI_INT. It worked, so I figured that must be the ticket. But I guess not. Will look into it. Thanks again. — bob.sacamento, Jun 16 '16 at 14:12
Strange, `MPI_AINT` and `MPI_OFFSET` are described together with all the other predefined MPI datatypes in section 3.2.2 (Message Data) of the standard at least since version 2.2. — Hristo Iliev, Jun 16 '16 at 14:32
It didn't occur to me that MPI_OFFSET would be its own datatype. My "go to" for this kind of thing is table 3.2, which doesn't mention it. I was assuming that table was exhaustive. — bob.sacamento, Jun 16 '16 at 15:11

score 2 · Accepted Answer · answered Jun 16 '16 at 19:15

2

Your approach is fine and if you need something right now to put bits in a file, go ahead and call yourself done.

Here are some suggestions for more efficiency:

You can consult the status object for how many bytes were written, instead of getting the position and translating into bytes.
If you have the memory to hold all the data before you write, you could describe your I/O with an MPI datatype (admittedly, one that might end up being a pain to create). Then all processes would issue a single collective call.
You should use collective I/O instead of independent I/O. A "quality library" should be able to give you equal if not better performance (and if not, you could raise the issue with your MPI implementation).
If the processes have different amounts of data to write, MPI_EXSCAN is a good way to collect who has what data. Then you can call MPI_FILE_WRITE_AT_ALL to the correct offset in the file.

answered Jun 16 '16 at 19:15

Rob Latham

5,085
3
27
44

Thanks! Re: your third bullet, I want only one proc to write the headers. I wasn't able to figure out how to do that w/o isolating those calls to MPI_File_write. Is there a way for everyone to call MPI_File_write, but only one proc do the actual writing? Thanks again! – bob.sacamento Jun 16 '16 at 19:35
1

absolutely. It's fair to have N processes call a collective operation, but only one of them has data. any process can pass in a 0 for the 'count' parameter. However, I was suggesting a single collective where rank 0 writes the header along with the array, while everyone else writes their portion of the array. – Rob Latham Jun 16 '16 at 19:37
understood. Thanks once more. – bob.sacamento Jun 16 '16 at 19:42
You should use "MPI_File_write_all" when you write the array - I have seen orders of magnitude performance improvements for large data sets compared to MPI_File_write. Telling the IO library you are doing a collective write (_all) enables it to aggregate data from different processes before writing, e.g. leading to a small number of large IO transactions as opposed to a large number of small transactions, which can significantly improve performance. – David Henty Jun 17 '16 at 09:42

MPI I/O, mix of single- and multiple-process output

1 Answers1

Linked