Non collective write using in file view

Question

When trying to write blocks to a file, with my blocks being unevenly distributed across my processes, one can use MPI_File_write_at with the good offset. As this function is not a collective operation, this works well. Exemple :

#include <cstdio>
#include <cstdlib>
#include <string>
#include <mpi.h>

int main(int argc, char* argv[])
{

    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int global  = 7; // prime helps have unbalanced procs
    int local   = (global/size) + (global%size>rank?1:0);   
    int strsize = 5;

    MPI_File fh;
    MPI_File_open(MPI_COMM_WORLD, "output.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);

    for (int i=0; i<local; ++i)
    {
        size_t idx = i * size + rank;
        std::string buffer = std::string(strsize, 'a' + idx);

        size_t      offset = buffer.size() * idx;
        MPI_File_write_at(fh, offset, buffer.c_str(), buffer.size(), MPI_CHAR, MPI_STATUS_IGNORE);
    }

    MPI_File_close(&fh);

    MPI_Finalize();
    return 0;
}

However for more complexe write, particularly when writting multi dimensional data like raw images, one may want to create a view at the file with MPI_Type_create_subarray. However, when using this methods with simple MPI_File_write (which is suppose to be non collective) I run in deadlocks. Exemple :

#include <cstdio>
#include <cstdlib>
#include <string>
#include <mpi.h>

int main(int argc, char* argv[])
{

    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int global  = 7; // prime helps have unbalanced procs
    int local   = (global/size) + (global%size>rank?1:0);   
    int strsize = 5;

    MPI_File fh;
    MPI_File_open(MPI_COMM_WORLD, "output.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);

    for (int i=0; i<local; ++i)
    {
        size_t idx = i * size + rank;

        std::string buffer = std::string(strsize, 'a' + idx);

        int dim = 2;
        int gsizes[2] = { buffer.size(), global };
        int lsizes[2] = { buffer.size(),      1 };
        int offset[2] = {             0,    idx };

        MPI_Datatype filetype;
        MPI_Type_create_subarray(dim, gsizes, lsizes, offset, MPI_ORDER_C, MPI_CHAR, &filetype);
        MPI_Type_commit(&filetype);

        MPI_File_set_view(fh, 0, MPI_CHAR, filetype, "native", MPI_INFO_NULL);
        MPI_File_write(fh, buffer.c_str(), buffer.size(), MPI_CHAR, MPI_STATUS_IGNORE);

    }

    MPI_File_close(&fh);

    MPI_Finalize();
    return 0;
}

How to avoid such a code to lock ? Keep in mind that by real code will really use the multidimensional capabilities of MPI_Type_create_subarray and cannot just use MPI_File_write_at

Also, it is difficult for me to know the maximum number of block in a process, so I'd like to avoid doing a reduce_all and then loop on the max number of block with empty writes when localnb <= id < maxnb

score 0 · Answer 1 · edited May 23 '17 at 10:24

You don't use MPI_REDUCE when you have a variable number of blocks per node. You use MPI_SCAN or MPI_EXSCAN: MPI IO Writing a file when offset is not known

MPI_File_set_view is collective, so if 'local' is different on each processor, you'll find yourself calling a collective routine from less than all processors in the communicator. If you really really need to do so, open the file with MPI_COMM_SELF.

the MPI_SCAN approach means each process can set the file view as needed, and then blammo you can call the collective MPI_File_write_at_all (even if some processes have zero work -- they still need to participate) and take advantage of whatever clever optimizations your MPI-IO implementation provides.

Non collective write using in file view

1 Answers1