Running a Fortran code containing a parallel OpenMP region, I have encountered a problem that after the code runs fine for some time (counter=~1,000,000,000 in the code below), it stalls without crashing or providing any errors. A code snippet reproducing the problem is:
program crasher
implicit none
integer*8 :: limit
integer*8 :: i,temp
integer*8 :: counter
limit=1062
counter=0
do
counter=counter+1
!$OMP PARALLEL &
!$OMP DEFAULT(none) &
!$OMP PRIVATE(i,temp) &
!$OMP SHARED(limit)
!$OMP DO
do i=1,limit
temp=0
enddo
!$OMP END DO
!$OMP END PARALLEL
if (mod(counter,100000).eq.0) then
write(6,'(A,I0)') "Number of runs: ",counter
endif
enddo
end program
When I do strace -p PID
, with the PIDs of the processes (16 cores) spawned by this code, one of them yields:
...
futex(0x7f4617a91a00, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f4617a91a44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 461, {1489578049, 907892000}, ffffffff^C) = -1 ETIMEDOUT (Connection timed out)
...
over and over again, even after the other processes have ceased to do anything. Running the same code on a different machine, the above strace output does not appear and the code runs fine. Running the code in serial, it runs fine on both machines.
I have compiled with ifort (v 15.0.2) as well as gfortran (v 4.8.5) with the same result on both machines: One machine works the other one does the crazy thing.
I have found some information that this might be a problem with the linux kernel. The machine that produces the error has "Linux 2.6.32-431.23.3.el6.x86_64" the other "Linux 3.10.0-327.18.2.el7.x86_64". Does anyone have an idea how to fix/work around this?