Strange occurrence with a send/recv MPI pair

Question

I have an application where the root rank is sending messages to all ranks in the following way:

tag = 22
if( myrankid == 0 )then 
  do i = 1, nproc 
    if(I==1)then 
        do j = 1, nvert
           xyz((j-1)*3+1) = data((j-1)*3+1,1)       
           xyz((j-1)*3+2) = data((j-1)*3+2,1)
           xyz((j-1)*3+3) = data((j-1)*3+3,1)
        enddo 
     else
        call mpi_send(data, glb_nvert(i)*3, mpi_real, i-1, tag, comm, ierr)
     endif
   enddo
 else
   
   call mpi_recv(data, glb_nvert(i)*3, mpi_real, 0, tag,comm, stat,ierr)

 endif

My problem is that at only when running above 3000 ranks this pair hangs at a certain mpi rank (on my specific app it is rank 2009)

Now, I do check that the sizes and arrays are consistent and the only thing I found interesting was the comm. The comm is a communicator which I have duplicated from another MPI communicator.

When I print comm like print*, comm all ranks except the root prints the same integer, except for the root.

E.g.

The root prints:

-1006632941

while rhe remaining 2999 ranks prints:

-1006632951

Is that really what causing the problem?

I have tried using intel mpi and the cray mpi.

The value of the communicator variable only has meaning locally. There is no reason whatsoever for it to be the same on all procs, and commonly is not. — Ian Bush, Jul 17 '21 at 11:05
What you are doing is effectively a `MPI_BCAST`. If you modify the code to use that does it work? — Ian Bush, Jul 17 '21 at 16:17
@IanBush, in reality `data` is loaded from a file using `hdf5` specific to each rank in the loop. I tried changing `comm` to `MPI_COMM_WORLD`, so even my initial suspicion whether `comm` was not behaving well is not correct, since the problem persisted even with when using `MPI_COMM_WORLD`. I can try to Remove the `hdf5` in my real application and just pass dummy arrays and see if the problem is still there to take out the question whether the problem is somehow related to `hdf5` library. But thanks for confirming that the communicator variable is locally set! — ATK, Jul 17 '21 at 22:22
Without see *exactly* what you are doing in a minimal, complete, reproducible example it's impossible to say more. — Ian Bush, Jul 18 '21 at 07:04
@IanBush I fully understand. I tried to write that MPI send/recv as standalone, although that did work, so won't be useful. I did also remove any other function call within the loop and just did send a dummy variable from root to each mpi rank (like a `bcast` but doing it as a send/recv). That also hang around the `2038`th rank. If I, on the other hand, changed the dummy variable from `size=7250` to just being of `size=1`; it worked. Not sure if MPI in the backend runs out of memory or does something problematic. The error seems to be linked with high relatively core count and large messages — ATK, Jul 19 '21 at 17:07
and it works fine with `gfortran` and `openmpi`. But won't work with intel mpi and cray mpi — ATK, Jul 19 '21 at 17:08
Sounds as though you are misusing blocking comunications. What happens if you change it to isend/irecv and add appropriate waits? — Ian Bush, Jul 19 '21 at 17:18
I did in fact. Although, the isend/irecv used is very primitive, since I had to put a wait immediately after the respective send/recv. Reason is in my real example the data exchanged is modified after each `send` so you want to make sure it has landed to the respective rank before modifying it — ATK, Jul 20 '21 at 09:05
Please let us know the exact command line you are using and also the hardware details — Arpita - Intel, Jul 29 '21 at 06:38
Using 30 nodes of 128 cores each node. I am using the AMD EPYC 64 core chip. Two of those Per node. I have used both srun but also explicitly mpirun ./app in the slurm script — ATK, Jul 30 '21 at 06:48

Strange occurrence with a send/recv MPI pair

0 Answers0