I have an MPI program that is fairly straight forward, essentially "initialize, 2 sends from master to slaves, 2 receives on slaves, do a bunch of system calls for copying/pasting then running a code, tidy up and mpi finalize".
This seems straightforward, but I'm not getting mpi_finalize to work correctly. Below is a snapshot of the program, without all the system copy/paste/call external code which I've rolled up in "do codish stuff" type statements.
program mpi_finalize_break
!<variable declarations>
call MPI_INIT(ierr)
icomm = MPI_COMM_WORLD
call MPI_COMM_SIZE(icomm,nproc,ierr)
call MPI_COMM_RANK(icomm,rank,ierr)
!<do codish stuff for a while>
if (rank == 0) then
!<set up some stuff then call MPI_SEND in a loop over number of slaves>
call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr)
call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr)
else
call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr)
call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr)
!<do codish stuff for a while>
endif
print*, "got here4", rank
call MPI_BARRIER(icomm,ierr)
print*, "got here5", rank, ierr
call MPI_FINALIZE(ierr)
print*, "got here6"
end program mpi_finalize_break
Now the problem I am seeing occurs around the "got here4", "got here5" and "got here6" statements. I get the appropriate number of print statements with corresponding ranks for "got here4", as well as "got here5". Meaning, the master and all the slaves (rank 0, and all other ranks) got to the barrier call, through the barrier call and to MPI_FINALIZE, reporting 0 for ierr on all of them. However, when it gets to "got here6", after the MPI_FINALIZE I'll get all kinds of weird behavior. Sometimes I'll get one less "got here6" than I expect, or sometimes I'll get 6 less, however the program hangs forever never closing and leaves an orphaned process on one (or more) of the compute nodes.
I am running this on an infiniband backbone machine, with the NFS server shared over infiniband (nfs-rdma). I'm trying to determine how the MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned runs (not the same node, nor the same number of orphans every time). I'm guessing it is related to the various system calls to cp, mv, ./run_some_code, cp, mv but wasn't sure if it may be related to the speed of infiniband too, as all this happens fairly quickly. I could have wrong intuition as well. Anybody have thoughts? I could put the whole code if helpful, but this condensed version I believe captures it. I'm running openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running firmware 2.9.1000.
Thanks for the help.
Update:
Per the request, I put the "MPI_Abort" in and get the following:
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
pburn 0000000000438CB1 Unknown Unknown Unknown
pburn 0000000000437407 Unknown Unknown Unknown
libmpi_usempif08. 00002B5BCB5C5712 Unknown Unknown Unknown
libmpi_usempif08. 00002B5BCB5C5566 Unknown Unknown Unknown
libmpi_usempif08. 00002B5BCB5B3DCC Unknown Unknown Unknown
libmpi_usempif08. 00002B5BCB594F63 Unknown Unknown Unknown
libpthread.so.0 000000345C00F710 Unknown Unknown Unknown
libc.so.6 000000345B8DB2ED Unknown Unknown Unknown
libc.so.6 000000345B872AEF Unknown Unknown Unknown
libc.so.6 000000345B866F26 Unknown Unknown Unknown
libopen-pal.so.6 00002B5BCC313EB2 Unknown Unknown Unknown
libopen-rte.so.7 00002B5BCC0416FE Unknown Unknown Unknown
libmpi.so.1 00002B5BCBD539DF Unknown Unknown Unknown
libmpi_mpifh.so.2 00002B5BCBADCF5A Unknown Unknown Unknown
pburn 0000000000416889 MAIN__ 415 parallel_burn.f90
pburn 00000000004043DE Unknown Unknown Unknown
libc.so.6 000000345B81ED5D Unknown Unknown Unknown
pburn 00000000004042E9 Unknown Unknown Unknown
But the code runs correctly otherwise (all the correct output files and things).