I am trying to execute an MPI program across a heterogeneous cluster, one running Ubuntu 12.04 (64-bit) and the other CentOS 6.4 (64-bit).
I compile a simple MPI program on CentOS, scp it over to Ubuntu, and test that it works with 1 or many MPI processes local to each machines. I can confirm it works on each separately.
When I try to execute the program on both machines, I get a "message truncated" error on the MPI_Wait. I believe this is telling me that 1 machine is sending a higher/lower number of bytes than the receiving machine is ready to take.
The program (snippet):
if(rank==0){
taskobject_id[0] = 4;
taskobject_id[1] = 5;
MPI_Request* req = new MPI_Request();
MPI_Isend(&taskobject_id, 2, MPI_INT, 1, 0, MPI_COMM_WORLD, req);
MPI_Status stat;
MPI_Wait(req, &stat);
}
else if(rank==1){
taskobject_id[0] = 1;
taskobject_id[1] = 1;
MPI_Request* req = new MPI_Request();
MPI_Irecv(&taskobject_id, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, req);
MPI_Status stat;
MPI_Wait(req, &stat);
}
My question is: is each machine evaluating a different number of bytes to send/receive in the communication? Is MPI_INT machine-dependent?
If so, does anyone have any pointers here as to what I should do to solve this?
EDIT: The problem persists when count=8 and type is MPI_BYTE. I'm at a loss.
EDIT2: Interestingly, the problem doesn't occur when the ranks are swapped. From testing, the operation is identical to when the receive operation takes a higher count than the sending operation sends. It is therefore clear that the CentOS machine thinks 1 count of MPI_INT is less than what the Ubuntu machine thinks.
When the receiver count > sender count, the wait operations complete and the code continues but MPI_Barrier causes the program to hang, even though both ranks are confirmed to 'enter' the barrier
Thanks!