Writing to multiple shared files with MPI-IO

Question

I'm running a simulation with thousands of MPI processes and need to write output data to a small set of files. For example, even though I might have 10,000 processes I only want to write out 10 files, with 1,000 writing to each one (at some appropriate offset). AFAIK the correct way to do this is to create a new communicator for the groups of processes that will be writing to the same files, open a shared file for that communicator with MPI_File_open(), and then write to it with MPI_File_write_at_all(). Is that correct? The following code is a toy example that I wrote up:

#include <mpi.h>
#include <math.h>
#include <stdio.h>

const int MAX_NUM_FILES = 4;

int main(){
    MPI_Init(NULL, NULL);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int numProcs;
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

    int numProcsPerFile = ceil(((double) numProcs) / MAX_NUM_FILES);
    int targetFile = rank / numProcsPerFile;

    MPI_Comm fileComm;
    MPI_Comm_split(MPI_COMM_WORLD, targetFile, rank, &fileComm);

    int targetFileRank;
    MPI_Comm_rank(fileComm, &targetFileRank);

    char filename[20]; // Sufficient for testing purposes
    snprintf(filename, 20, "out_%d.dat", targetFile);
    printf(
        "Proc %d: writing to file %s with rank %d\n", rank, filename,
        targetFileRank);

    MPI_File outFile;
    MPI_File_open(
        fileComm, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY,
        MPI_INFO_NULL, &outFile);

    char bufToWrite[4];
    snprintf(bufToWrite, 4, "%3d", rank);

    MPI_File_write_at_all(
        outFile, targetFileRank * 3,
        bufToWrite, 3, MPI_CHAR, MPI_STATUS_IGNORE);

    MPI_File_close(&outFile);
    MPI_Finalize();
}

I can compile with mpicc file.c -lm and run, say, 20 processes with mpirun -np 20 a.out, and I get the expected output (four files with five entries each), but I'm unsure whether this is the technically correct/most optimal way of doing it. Is there anything I should do differently?

MPI I/O is worth to use in the context of collective I/O, which is what you are doing with `MPI_File_write_at_all`. I think what you meant here is: How scalable is this approach and how well your real program will perform with 1000..., right? — Arash, Mar 02 '17 at 23:44
Not quite. I'm asking whether my approach of creating one communicator per output file and then writing to those communicators with `MPI_File_write_at_all()` is the technically correct thing to do. For instance, do I actually need to create a separate communicator per output file, or can I get away with just using `MPI_COMM_WORLD` or `MPI_COMM_SELF`, or something like that? I'm also more generally looking for an assessment of my code, since maybe I'm doing something wrong/in a way that's not idiomatic for MPI. — sevko, Mar 02 '17 at 23:48
Oh, I see. Yes! A separate communicator is needed for every unique file. Look [here](http://mpi-forum.org/docs/mpi-2.2/mpi22-report-book.pdf), page 391, "MPI_FILE_OPEN is a collective routine: all processes must provide the same value for amode, and all processes must provide filenames that reference the same file. (Values for info may vary.) comm must be an intracommunicator; it is erroneous to pass an intercommunicator to MPI_FILE_OPEN. Errors in MPI_FILE_OPEN are raised using the default file error handler (see Section 13.7, page 447)" — Arash, Mar 03 '17 at 00:30
In general, the number of communicators depends on the logic of your application and which processes are supposed to open which files. The whole idea is about grouping the processes in such a way that MPI runtime can handle the synchronizations needed for collective I/O. — Arash, Mar 03 '17 at 00:48
Excellent, thanks! If you post your comments as an answer (with the link and all) I'll accept it. — sevko, Mar 04 '17 at 02:57

Al Barrentine · Answer 1 · 2017-10-24T21:37:19.713

1

MPI_File_write_at_all should be the most efficient way to do this. Collective IO functions are typically fastest for large non-contiguous parallel writes to a shared file and the _all variant combines the seek and the write into one call.

edited Oct 24 '17 at 21:37

answered Feb 24 '17 at 22:04

Al Barrentine

11
1
2

score 1 · Accepted Answer · answered Mar 04 '17 at 04:48

Your approach is correct. To clarify, we need to revisit the standard and the definitions. MPI_File_Open API from MPI: A Message-Passing Interface Standard Version 2.2 (page 391)

int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info, MPI_File *fh)

Description:

MPI_FILE_OPEN opens the file identified by the file name filename on all processes in the comm communicator group. MPI_FILE_OPEN is a collective routine: all processes must provide the same value for amode, and all processes must provide filenames that reference the same file. (Values for info may vary.) comm must be an intracommunicator; it is erroneous to pass an intercommunicator to MPI_FILE_OPEN.

intracommunicator vs intercommunicator (page 134):

For the purposes of this chapter, it is sufficient to know that there are two types of communicators: intra-communicators and inter-communicators. An intracommunicator can be thought of as an identifier for a single group of processes linked with a context. An intercommunicator identifies two distinct groups of processes linked with a context.

The point of passing an intracommunicator to MPI_File_open()is to specify a set of processes that will perform operations on the file. This information is needed by the MPI runtime, so it could enforce appropriate synchronizations when collective I/O operations occur. It is the programmer's responsibility to understand the logic of the application and create/choose the correct intracommunicators.

MPI_Comm_Split() in a powerful API that allows to split a communicating group into disjoint subgroups to use for different use cases including MPI I/O.

score 1 · Answer 3 · answered Mar 09 '17 at 18:16

I think it's probably a typo above, but it's the "_all" that signifies a collective operation.

The main point I wanted to make, however, was that the reason the collective operations are faster is that they enable the I/O system to aggregate data from many processes. You may issue 1000 writes from 1000 processes, but with the collective form this might be aggregated into a single large write to the file (rather than 1000 small writes). This is of course a best-case scenario, but the improvements can be dramatic - for access to a shared file I have seen collective I/O go 1000 times faster than non-collective, admittedly for more complicated IO patterns than this.

Writing to multiple shared files with MPI-IO

3 Answers3