Parallel (OpenMP) Fortran code stalls after long time without giving error

Question

Running a Fortran code containing a parallel OpenMP region, I have encountered a problem that after the code runs fine for some time (counter=~1,000,000,000 in the code below), it stalls without crashing or providing any errors. A code snippet reproducing the problem is:

program crasher
    implicit none
    integer*8 :: limit
    integer*8 :: i,temp
    integer*8 :: counter
    limit=1062
    counter=0
    do  
        counter=counter+1
        !$OMP PARALLEL &
        !$OMP DEFAULT(none) &
        !$OMP PRIVATE(i,temp) &
        !$OMP SHARED(limit)
        !$OMP DO
        do i=1,limit
            temp=0
        enddo
        !$OMP END DO
        !$OMP END PARALLEL

        if (mod(counter,100000).eq.0) then
            write(6,'(A,I0)') "Number of runs: ",counter
        endif
    enddo    
end program

When I do strace -p PID, with the PIDs of the processes (16 cores) spawned by this code, one of them yields:

...

futex(0x7f4617a91a00, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x7f4617a91a44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 461, {1489578049, 907892000}, ffffffff^C) = -1 ETIMEDOUT (Connection timed out)

...

over and over again, even after the other processes have ceased to do anything. Running the same code on a different machine, the above strace output does not appear and the code runs fine. Running the code in serial, it runs fine on both machines.

I have compiled with ifort (v 15.0.2) as well as gfortran (v 4.8.5) with the same result on both machines: One machine works the other one does the crazy thing.

I have found some information that this might be a problem with the linux kernel. The machine that produces the error has "Linux 2.6.32-431.23.3.el6.x86_64" the other "Linux 3.10.0-327.18.2.el7.x86_64". Does anyone have an idea how to fix/work around this?

Have you tried to look at the stacks of the processes during the stall using `gdb`? — Zulan, Mar 15 '17 at 12:40
I would start with HUGE(counter) and maybe have the MODULO use and integer*8 for the 1000000. What happens when you remove the !$OMP PARALLEL and !$OMP END PARALLEL ? Or maybe sumtemp = sumtemp + Temp and adding in REDUCTION(+: SumTemp), and it is possible that temp should be SHARED. — Holmz, Mar 16 '17 at 12:57

Parallel (OpenMP) Fortran code stalls after long time without giving error

0 Answers0