0

I'm trying to work on a problem in Fortran using MPI, and I am getting an intermittent bug with clearly incorrect values appearing. The bug seems to occur when I use MPI_REDUCE.

I've pared my code down to as short a segment as possible, with the error still happening. This segment of code is pretty useless aside from its strange behaviour. Try as I might, I can't isolate it any further. I don't understand the behaviour of this code - for example, if I remove the subroutine at the top (which is never invoked), the bug appears to go away. If I allocate the arrays using real,dimension(10,10) when I am declaring them, the bug appears to go away, although I don't think my current allocates are incorrect. Even if I change some of the variable names in this, the bug appears to go away. None of these are telling me why the bug exists, or how to fix it in my longer code project. It seems like either I have failed to correctly allocate memory somewhere, or I am using MPI_REDUCE incorrectly, but I can't find the problem.

  subroutine foo()
  use netcdf
  integer                          :: iret,ncid
  iret = nf90_open('test.nc',nf90_nowrite,ncid)  !open the mask file
  iret = nf90_close(ncid) !close the mask file 
  return
end subroutine foo

program test
use mpi
integer   :: ierr,pid
real :: diffsum,total_sum
real,allocatable,dimension(:,:) :: c,h,h_old

call MPI_INIT(ierr)       

total_sum = 0.0

call MPI_COMM_RANK(MPI_COMM_WORLD,pid,ierr)

if(pid.ne.0) then
  allocate(h   (10,10))
  allocate(h_old(10,10))
  h(:,:) = 1.0
  h_old(:,:) = 1.0
  allocate(c(10,10))
  c = h_old - h
  diffsum = 0.0
endif

call MPI_REDUCE(diffsum,total_sum,1,MPI_REAL,mpi_sum,0,MPI_COMM_WORLD,ierr)  !to get overall threshold

if(pid.eq.0)then
  print*,'sum',total_sum
endif

call MPI_FINALIZE(ierr)
end program test

The value printed should always be 0, but sometimes other values appear. Here is an example of the outputs from 10 runs:

 sum  -3.66304099E+25
 sum   0.00000000    
 sum   0.00000000    
 sum  -3.01998057E+29
 sum   0.00000000    
 sum   0.00000000    
 sum   0.00000000    
 sum   0.00000000    
 sum   0.00000000    
 sum   0.00000000    

Thank you for any ideas!

K_Lee
  • 1
  • Well you haven't initialized diffsum on rank zero, so the results of your program are undefined. See if correcting that fixes it. – Ian Bush Jun 20 '19 at 20:52
  • @IanBush now I feel silly! I actually did notice that earlier, changed it, and thought it didn't work - I must have forgotten to recompile. Thank you so much! – K_Lee Jun 20 '19 at 21:47

0 Answers0