Is the message queue for MPI_Send and MPI_Receive global or specific to each destination?

Question

Please correct me if I am misunderstanding how MPI_Send and MPI_Recv work, since I have just started learning MPI.

My current understanding is that the MPI standard guarantees that two messages which are sent one after another from one sender to one receiver will always appear to the receiver in the same order they were sent. This suggests to me that some kind of queuing must be happening either at the receiver, or the sender, or as part of some distributed state.

I am trying to understand the nature of this queue, so I wrote a simple pingpong program where all the odd-ranked nodes would send and receive with the even-ranked node whose node number was under it.

The idea is that if there is a global queue shared across all the nodes in the cluster, then running with a higher number of nodes should substantially increase the latency observed at each node. On the other hand, if the queue is at each receiver, then the latency increase should be relatively small. However, I get very mixed results, so I am not sure how to interpret them.

Can someone provide an interpretation of the following results, with respect to where the queue is resident?

$ mpirun -np 2 simple
Rank = 0, Message Length = 0, end - start = 0.000119
$ mpirun -np 2 simple
Rank = 0, Message Length = 0, end - start = 0.000117
$ mpirun -np 4 simple
Rank = 2, Message Length = 0, end - start = 0.000119
Rank = 0, Message Length = 0, end - start = 0.000253
$ mpirun -np 4 simple
Rank = 2, Message Length = 0, end - start = 0.000129
Rank = 0, Message Length = 0, end - start = 0.000303
$ mpirun -np 6 simple
Rank = 4, Message Length = 0, end - start = 0.000144
Rank = 2, Message Length = 0, end - start = 0.000122
Rank = 0, Message Length = 0, end - start = 0.000415
$ mpirun -np 8 simple
Rank = 4, Message Length = 0, end - start = 0.000119
Rank = 0, Message Length = 0, end - start = 0.000336
Rank = 2, Message Length = 0, end - start = 0.000323
Rank = 6, Message Length = 0, end - start = 0.000287
$ mpirun -np 10 simple
Rank = 2, Message Length = 0, end - start = 0.000127
Rank = 8, Message Length = 0, end - start = 0.000158
Rank = 0, Message Length = 0, end - start = 0.000281
Rank = 4, Message Length = 0, end - start = 0.000286
Rank = 6, Message Length = 0, end - start = 0.000278

This is the code that implements the pingpong.

#include "mpi.h" // MPI_I*
#include <stdlib.h>


#define MESSAGE_COUNT 100

int main(int argc, char* argv[]){

    if (MPI_Init( &argc, &argv) != MPI_SUCCESS) {
        std::cerr << "MPI Failed to Initialize" << std:: endl;
        return 1;
    }
    int rank = 0, size = 0;

    // Get processors ID within the communicator
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    size_t message_len = 0;

    char* buf = new char[message_len];

    MPI_Status status;
    // Pingpong between even and odd machines
    if (rank & 1) { // Odd ranked machine will just pong
        for (int i = 0; i < MESSAGE_COUNT; i++) {
            MPI_Recv(buf, (int) message_len, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
            MPI_Send(buf, (int) message_len, MPI_CHAR, rank - 1, 0, MPI_COMM_WORLD);
        }
    }
    else { // Even ranked machine will ping and time.
        double start = MPI_Wtime();

        for (int i = 0; i < MESSAGE_COUNT; i++) {
            MPI_Send(buf, (int) message_len, MPI_CHAR, rank + 1, 0, MPI_COMM_WORLD);
            MPI_Recv(buf, (int) message_len, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
        }

        double end = MPI_Wtime();
        printf("Rank = %d, Message Length = %zu, end - start = %f\n", rank, message_len, end - start);
    }
    delete[] buf;

    MPI_Finalize();
    return 0;
}

This kind of detail is up to the MPI implementation. You may glean some information by perusing your MPI library's source code, if available. — suszterpatt, Apr 10 '14 at 20:39
There are no global queues per se in MPI though different network hardware usually utilise some sort of queueing. In MPI only rank-to-rank ordering is guaranteed and only for messages within the same communicator having the same tag. Note that `MPI_Init()` could finish at very different times in different ranks therefore you should insert a call to `MPI_Barrier` before you start measuring the ping-pong latency. — Hristo Iliev, Apr 11 '14 at 06:48

Is the message queue for MPI_Send and MPI_Receive global or specific to each destination?

0 Answers0