5

I am trying to execute an MPI program across a heterogeneous cluster, one running Ubuntu 12.04 (64-bit) and the other CentOS 6.4 (64-bit).

I compile a simple MPI program on CentOS, scp it over to Ubuntu, and test that it works with 1 or many MPI processes local to each machines. I can confirm it works on each separately.

When I try to execute the program on both machines, I get a "message truncated" error on the MPI_Wait. I believe this is telling me that 1 machine is sending a higher/lower number of bytes than the receiving machine is ready to take.

The program (snippet):

    if(rank==0){

        taskobject_id[0] = 4;
        taskobject_id[1] = 5;
        MPI_Request* req = new MPI_Request();

        MPI_Isend(&taskobject_id, 2, MPI_INT, 1, 0, MPI_COMM_WORLD, req);

        MPI_Status stat;
        MPI_Wait(req, &stat);

    }
    else if(rank==1){

        taskobject_id[0] = 1;
        taskobject_id[1] = 1;
        MPI_Request* req = new MPI_Request();

        MPI_Irecv(&taskobject_id, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, req);

        MPI_Status stat;
        MPI_Wait(req, &stat);
    }

My question is: is each machine evaluating a different number of bytes to send/receive in the communication? Is MPI_INT machine-dependent?

If so, does anyone have any pointers here as to what I should do to solve this?

EDIT: The problem persists when count=8 and type is MPI_BYTE. I'm at a loss.

EDIT2: Interestingly, the problem doesn't occur when the ranks are swapped. From testing, the operation is identical to when the receive operation takes a higher count than the sending operation sends. It is therefore clear that the CentOS machine thinks 1 count of MPI_INT is less than what the Ubuntu machine thinks.

When the receiver count > sender count, the wait operations complete and the code continues but MPI_Barrier causes the program to hang, even though both ranks are confirmed to 'enter' the barrier

Thanks!

kraffenetti
  • 365
  • 1
  • 10
ricky116
  • 744
  • 8
  • 21
  • 1
    Let's get the obvious out of the way: are you using the exact same version of OpenMPI on all machines? Compiled yourself or from their respective repos? – Adam Sep 26 '13 at 08:02
  • There is a possibility that there is a difference in exact OpenMPI configuration between the 2, as both machines were not configured by me or even by the same person, although ompi_info shows both versions are 1.5.4. Another machine is currently being set up with exactly the same OS and MPI package as the CentOS machine, so I will update this question with my findings – ricky116 Sep 26 '13 at 14:02
  • 1
    Even if the version as shown by `ompi_info` is the same, there are configuration-time options that could influence the content of each message. Compare for example the output of `ompi_info | grep Hetero` on both machines. – Hristo Iliev Sep 26 '13 at 14:56
  • I can confirm that the same code did not produce the problem when the cluster was made homogeneous (2 CentOS identical machines, same MPI build). Thanks to Hristo Iliev I realised on my previous set up one of the machines did not have Heterogeneous support in the MPI build. – ricky116 Oct 01 '13 at 13:37
  • It is probably incompatible OpenMPI versions. Heterogeneous in this context means different platforms (different processors, OS or OS versions), not different MPI implementations; you should have the same OS version. In your case, I assume that you are using OpenMPI versions that are incompatible with each other. – ipapadop May 14 '14 at 19:34

0 Answers0