1

I need an MPI C code to write data to a binary file via MPI I/O. I need process 0 to write a short header, then I need the whole range of processes to write their own pieces of the array indicated by the header. Then I need process 0 to write another header, followed by all processes writing their pieces of the next array, etc. I came up with the following test code which actually does what I want. No one will be more surprised about that than me.

My question is, I am new at MPI I/O. So am I "getting it"? Am I doing this the "right way" or is there some more efficient or compact way to do it?

Code is: (BTW, if you think of testing this, try it with 4 procs only.)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "mpi.h"

#define ROWS 9
#define COLS 10

int main(int argc, char *argv[]) {

   int size_mpi, rank_mpi, row_mpi, col_mpi;
   int i,j,p,ttlcols;
   int sizes[]= {2*ROWS,2*COLS};
   int subsizes[]= {ROWS,COLS};
   int starts[] = {0,0};
   int vals[ROWS][COLS];
   char hdr[] = "This is just a header.\n";
   MPI_Status stat_mpi;
   MPI_Datatype subarray;
   MPI_File fh;
   MPI_Offset offset, end_of_hdr;
   MPI_Info info_mpi;

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD,&size_mpi);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank_mpi);

   ttlcols = 2*COLS;
   /* Where are we in the array of processes? */
   col_mpi = rank_mpi%2;
   row_mpi = rank_mpi/2;
   /* Populate the array */
   for (j=0; j<ROWS; j++){
      for (i=0; i<COLS; i++){
         vals[j][i] = ttlcols*(ROWS*row_mpi + j) +
                      COLS*col_mpi + i;
      }
   } 
   /* MPI derived datatype for setting a file view */    
   starts[0] = row_mpi*ROWS;
   starts[1] = col_mpi*COLS;
   MPI_Type_create_subarray(2, sizes, subsizes, starts,
                            MPI_ORDER_C, MPI_INT,
                            &subarray); 
   MPI_Type_commit(&subarray);
   /* open the file */    
   printf("opening file\n");
   MPI_File_open(MPI_COMM_WORLD, "arrdata.dat", 
                 MPI_MODE_WRONLY | MPI_MODE_CREATE,
                 MPI_INFO_NULL, &fh);
   printf("opened file\n");
   /* set the initial file view */    
   MPI_File_set_view(fh, 0, MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL);
   /* proc 0 writes first header */    
   if (rank_mpi == 0) {
      MPI_File_write(fh, (void*)hdr, strlen(hdr), MPI_CHAR, &stat_mpi);
      MPI_File_get_position(fh, &offset);
      MPI_File_get_byte_offset(fh, offset, &end_of_hdr); 
   }
   /* everybody has to know where proc 0 stopped writing */    
   MPI_Bcast((void*)&end_of_hdr, 1, MPI_INT, 0, MPI_COMM_WORLD);
   /* re-set file view for writing first array */    
   MPI_File_set_view(fh, end_of_hdr, MPI_INT,
                     subarray, "native",
                     MPI_INFO_NULL);
   /* and write the array */    
   MPI_File_write(fh, (void*)vals, ROWS*COLS, MPI_INT,
                  &stat_mpi);

   /* now go through the whole thing again to test */
   MPI_File_get_position(fh, &offset);
   MPI_File_get_byte_offset(fh, offset, &end_of_hdr); 
   MPI_File_set_view(fh, end_of_hdr, MPI_CHAR, MPI_CHAR, "native", MPI_INFO_NULL);
   if (rank_mpi == 0) {
      MPI_File_write(fh, (void*)hdr, strlen(hdr), MPI_CHAR, &stat_mpi);
      MPI_File_get_position(fh, &offset);
      MPI_File_get_byte_offset(fh, offset, &end_of_hdr); 
   }

   MPI_Bcast((void*)&end_of_hdr, 1, MPI_INT, 0, MPI_COMM_WORLD);

   MPI_File_set_view(fh, end_of_hdr, MPI_INT,
                     subarray, "native",
                     MPI_INFO_NULL);
   MPI_File_write(fh, (void*)vals, ROWS*COLS, MPI_INT,
                  &stat_mpi);
   MPI_File_close(&fh);

   MPI_Finalize();

   return 0;

}
bob.sacamento
  • 6,283
  • 10
  • 56
  • 115
  • Do you always know the header sizes upfront, and are they all identical (the sizes)? If so, you can create a view for process #0 that include the header and data, and another for the other processes that only includes the data. Then you'd have only one call to make to `MPI_File_set_view()` and 2 calls to `MPI_File_write()` per iteration for rank #0 and one for the other ranks. – Gilles Jun 15 '16 at 15:54
  • @Gilles I know the header sizes upfront. They are not all identical, unfortunately. I can't change that. Not my design. – bob.sacamento Jun 15 '16 at 16:27
  • 1
    There is a bug in your code: `end_of_hdr` is of type `MPI_Offset` and you are broadcasting it using `MPI_INT`. `MPI_Offset` is typically 64-bit while `MPI_INT` corresponds to `int`, which on LP64 Unix systems (*BSD, Linux, Solaris) is only 32-bit. On a big endian system, a very wrong offset will get broadcasted. Use `MPI_OFFSET` (if supported by the MPI implementation) or the corresponding C type (check your `mpi.h`) – Hristo Iliev Jun 16 '16 at 09:14
  • @HristoIliev Thank you. I was wondering about that. Couldn't find anything about "MPI_OFFSET" in the standard so I tried MPI_INT. It worked, so I figured that must be the ticket. But I guess not. Will look into it. Thanks again. – bob.sacamento Jun 16 '16 at 14:12
  • Strange, `MPI_AINT` and `MPI_OFFSET` are described together with all the other predefined MPI datatypes in section 3.2.2 (Message Data) of the standard at least since version 2.2. – Hristo Iliev Jun 16 '16 at 14:32
  • It didn't occur to me that MPI_OFFSET would be its own datatype. My "go to" for this kind of thing is table 3.2, which doesn't mention it. I was assuming that table was exhaustive. – bob.sacamento Jun 16 '16 at 15:11

1 Answers1

2

Your approach is fine and if you need something right now to put bits in a file, go ahead and call yourself done.

Here are some suggestions for more efficiency:

  • You can consult the status object for how many bytes were written, instead of getting the position and translating into bytes.

  • If you have the memory to hold all the data before you write, you could describe your I/O with an MPI datatype (admittedly, one that might end up being a pain to create). Then all processes would issue a single collective call.

  • You should use collective I/O instead of independent I/O. A "quality library" should be able to give you equal if not better performance (and if not, you could raise the issue with your MPI implementation).

  • If the processes have different amounts of data to write, MPI_EXSCAN is a good way to collect who has what data. Then you can call MPI_FILE_WRITE_AT_ALL to the correct offset in the file.

Rob Latham
  • 5,085
  • 3
  • 27
  • 44
  • Thanks! Re: your third bullet, I want only one proc to write the headers. I wasn't able to figure out how to do that w/o isolating those calls to MPI_File_write. Is there a way for everyone to call MPI_File_write, but only one proc do the actual writing? Thanks again! – bob.sacamento Jun 16 '16 at 19:35
  • 1
    absolutely. It's fair to have N processes call a collective operation, but only one of them has data. any process can pass in a 0 for the 'count' parameter. However, I was suggesting a single collective where rank 0 writes the header along with the array, while everyone else writes their portion of the array. – Rob Latham Jun 16 '16 at 19:37
  • understood. Thanks once more. – bob.sacamento Jun 16 '16 at 19:42
  • You should use "MPI_File_write_all" when you write the array - I have seen orders of magnitude performance improvements for large data sets compared to MPI_File_write. Telling the IO library you are doing a collective write (_all) enables it to aggregate data from different processes before writing, e.g. leading to a small number of large IO transactions as opposed to a large number of small transactions, which can significantly improve performance. – David Henty Jun 17 '16 at 09:42