I'm trying to work on a problem in Fortran using MPI, and I am getting an intermittent bug with clearly incorrect values appearing. The bug seems to occur when I use MPI_REDUCE.
I've pared my code down to as short a segment as possible, with the error still happening. This segment of code is pretty useless aside from its strange behaviour. Try as I might, I can't isolate it any further. I don't understand the behaviour of this code - for example, if I remove the subroutine at the top (which is never invoked), the bug appears to go away. If I allocate the arrays using real,dimension(10,10) when I am declaring them, the bug appears to go away, although I don't think my current allocates are incorrect. Even if I change some of the variable names in this, the bug appears to go away. None of these are telling me why the bug exists, or how to fix it in my longer code project. It seems like either I have failed to correctly allocate memory somewhere, or I am using MPI_REDUCE incorrectly, but I can't find the problem.
subroutine foo()
use netcdf
integer :: iret,ncid
iret = nf90_open('test.nc',nf90_nowrite,ncid) !open the mask file
iret = nf90_close(ncid) !close the mask file
return
end subroutine foo
program test
use mpi
integer :: ierr,pid
real :: diffsum,total_sum
real,allocatable,dimension(:,:) :: c,h,h_old
call MPI_INIT(ierr)
total_sum = 0.0
call MPI_COMM_RANK(MPI_COMM_WORLD,pid,ierr)
if(pid.ne.0) then
allocate(h (10,10))
allocate(h_old(10,10))
h(:,:) = 1.0
h_old(:,:) = 1.0
allocate(c(10,10))
c = h_old - h
diffsum = 0.0
endif
call MPI_REDUCE(diffsum,total_sum,1,MPI_REAL,mpi_sum,0,MPI_COMM_WORLD,ierr) !to get overall threshold
if(pid.eq.0)then
print*,'sum',total_sum
endif
call MPI_FINALIZE(ierr)
end program test
The value printed should always be 0, but sometimes other values appear. Here is an example of the outputs from 10 runs:
sum -3.66304099E+25
sum 0.00000000
sum 0.00000000
sum -3.01998057E+29
sum 0.00000000
sum 0.00000000
sum 0.00000000
sum 0.00000000
sum 0.00000000
sum 0.00000000
Thank you for any ideas!