2

I am working on a C++ application, where I use the MPI C bindings to send and receive data over a network. I understand that sending

const int VECTOR_SIZE = 1e6;
std::vector<int> vector(VECTOR_SIZE, 0.0);

via

// Version A
MPI_Send(const_cast<int *>(vector.data()), vector.size(), MPI_INT, 1, 0, MPI_COMM_WORLD);

is much more efficient than

// Version B
for (const auto &element : vector)
    MPI_Send(const_cast<int *>(&element), 1, MPI_INT, 1, 0, MPI_COMM_WORLD);

due to the latency introduced by MPI_Send. However, if I want to send data structures which are not contiguous in memory (a std::list<int>, for instance), I cannot use version A but have to resort to version B or copy the list's content to a contiguous container (like std::vector<int>, for instance) first and use version A. Since I want to avoid an extra copy, I wonder if there are any options/other functions in MPI which allow for an efficient use of Version B (or at least a similar, loop-like construct) without incurring the latency each time MPI_Send is called?

Marcel
  • 616
  • 1
  • 5
  • 15

1 Answers1

2

Stepping and sending one by one through your std::list elements would indeed cause a significant communication overhead.

The MPI specification/library is designed to be language independent. This is why it uses language agnostic MPI datatypes. And the consequence is that it can only send from contiguous buffers (which is a feature that most languages offers) and not from more complex data structures like lists.

To avoid the communication overhead of sending one by one, there are two alternatives:

  • copy all the list elements into a std::vector and send the vector. However this creates a memory overhed AND makes the sending completely sequential (and during that time some MPI nodes could be iddle).

  • or iterate through your list, building smaller vectors/buffers and send these smaller chunks (eventually dispatching them to several destination nodes ?). This approach has the benefit of making better use of i/o latency and parallelism through a pipelining effect. You have however to experiment a little bit to find the optimal size of the intermediary chunks.

Community
  • 1
  • 1
Christophe
  • 68,716
  • 7
  • 72
  • 138