MPI_Recv/MPI_Send overhead

Question

I am working on a C++ application, where I use the MPI C bindings to send and receive data over a network. I understand that sending

const int VECTOR_SIZE = 1e6;
std::vector<int> vector(VECTOR_SIZE, 0.0);

via

// Version A
MPI_Send(const_cast<int *>(vector.data()), vector.size(), MPI_INT, 1, 0, MPI_COMM_WORLD);

is much more efficient than

// Version B
for (const auto &element : vector)
    MPI_Send(const_cast<int *>(&element), 1, MPI_INT, 1, 0, MPI_COMM_WORLD);

due to the latency introduced by MPI_Send. However, if I want to send data structures which are not contiguous in memory (a std::list<int>, for instance), I cannot use version A but have to resort to version B or copy the list's content to a contiguous container (like std::vector<int>, for instance) first and use version A. Since I want to avoid an extra copy, I wonder if there are any options/other functions in MPI which allow for an efficient use of Version B (or at least a similar, loop-like construct) without incurring the latency each time MPI_Send is called?

You might look at Boost MPI, since it supports STL containers. — Jeff Hammond, Jan 17 '16 at 18:11

score 2 · Accepted Answer · edited May 23 '17 at 10:34

Stepping and sending one by one through your std::list elements would indeed cause a significant communication overhead.

The MPI specification/library is designed to be language independent. This is why it uses language agnostic MPI datatypes. And the consequence is that it can only send from contiguous buffers (which is a feature that most languages offers) and not from more complex data structures like lists.

To avoid the communication overhead of sending one by one, there are two alternatives:

copy all the list elements into a std::vector and send the vector. However this creates a memory overhed AND makes the sending completely sequential (and during that time some MPI nodes could be iddle).
or iterate through your list, building smaller vectors/buffers and send these smaller chunks (eventually dispatching them to several destination nodes ?). This approach has the benefit of making better use of i/o latency and parallelism through a pipelining effect. You have however to experiment a little bit to find the optimal size of the intermediary chunks.

MPI_Recv/MPI_Send overhead

1 Answers1