1

I have an application where the root rank is sending messages to all ranks in the following way:

tag = 22
if( myrankid == 0 )then 
  do i = 1, nproc 
    if(I==1)then 
        do j = 1, nvert
           xyz((j-1)*3+1) = data((j-1)*3+1,1)       
           xyz((j-1)*3+2) = data((j-1)*3+2,1)
           xyz((j-1)*3+3) = data((j-1)*3+3,1)
        enddo 
     else
        call mpi_send(data, glb_nvert(i)*3, mpi_real, i-1, tag, comm, ierr)
     endif
   enddo
 else
   
   call mpi_recv(data, glb_nvert(i)*3, mpi_real, 0, tag,comm, stat,ierr)

 endif

My problem is that at only when running above 3000 ranks this pair hangs at a certain mpi rank (on my specific app it is rank 2009)

Now, I do check that the sizes and arrays are consistent and the only thing I found interesting was the comm. The comm is a communicator which I have duplicated from another MPI communicator.

When I print comm like print*, comm all ranks except the root prints the same integer, except for the root.

E.g.

The root prints:

-1006632941

while rhe remaining 2999 ranks prints:

-1006632951

Is that really what causing the problem?

I have tried using intel mpi and the cray mpi.

ATK
  • 1,296
  • 10
  • 26
  • A [mcve] is likely to be needed. – Vladimir F Героям слава Jul 17 '21 at 11:00
  • The value of the communicator variable only has meaning locally. There is no reason whatsoever for it to be the same on all procs, and commonly is not. – Ian Bush Jul 17 '21 at 11:05
  • What you are doing is effectively a `MPI_BCAST`. If you modify the code to use that does it work? – Ian Bush Jul 17 '21 at 16:17
  • @IanBush, in reality `data` is loaded from a file using `hdf5` specific to each rank in the loop. I tried changing `comm` to `MPI_COMM_WORLD`, so even my initial suspicion whether `comm` was not behaving well is not correct, since the problem persisted even with when using `MPI_COMM_WORLD`. I can try to Remove the `hdf5` in my real application and just pass dummy arrays and see if the problem is still there to take out the question whether the problem is somehow related to `hdf5` library. But thanks for confirming that the communicator variable is locally set! – ATK Jul 17 '21 at 22:22
  • Without see *exactly* what you are doing in a minimal, complete, reproducible example it's impossible to say more. – Ian Bush Jul 18 '21 at 07:04
  • @IanBush I fully understand. I tried to write that MPI send/recv as standalone, although that did work, so won't be useful. I did also remove any other function call within the loop and just did send a dummy variable from root to each mpi rank (like a `bcast` but doing it as a send/recv). That also hang around the `2038`th rank. If I, on the other hand, changed the dummy variable from `size=7250` to just being of `size=1`; it worked. Not sure if MPI in the backend runs out of memory or does something problematic. The error seems to be linked with high relatively core count and large messages – ATK Jul 19 '21 at 17:07
  • and it works fine with `gfortran` and `openmpi`. But won't work with intel mpi and cray mpi – ATK Jul 19 '21 at 17:08
  • Sounds as though you are misusing blocking comunications. What happens if you change it to isend/irecv and add appropriate waits? – Ian Bush Jul 19 '21 at 17:18
  • I did in fact. Although, the isend/irecv used is very primitive, since I had to put a wait immediately after the respective send/recv. Reason is in my real example the data exchanged is modified after each `send` so you want to make sure it has landed to the respective rank before modifying it – ATK Jul 20 '21 at 09:05
  • How many nodes are you running your MPI application on? – ArunJose Jul 28 '21 at 08:34
  • Please let us know the exact command line you are using and also the hardware details – Arpita - Intel Jul 29 '21 at 06:38
  • Using 30 nodes of 128 cores each node. I am using the AMD EPYC 64 core chip. Two of those Per node. I have used both srun but also explicitly mpirun ./app in the slurm script – ATK Jul 30 '21 at 06:48

0 Answers0