2

I am trying to improve one of the code I am using for numerical simulations. One of the computation step requires to compute several large arrays, whose computation is complex and costly. What is done now is that each array is computed by a specific MPI-process, and then broadcasted to the other ones. The subroutine doing that is called very often during a run of the program, so it needs to be as fast as possible.

However, I suspect that the five successive MPI_BCAST are deleterious for the program performances... I did some tests using a non-blocking BCAST (MPI_IBCAST) and I saw an improvement of the performances. Unfortunately, I cannot use it at it does not seem to be available in some MPI implementations (at least the versions installed on the clusters I'm using...).

Do you have any ideas on how to improve this situation ? Below is a simplified version of the code I'm trying to optimize...

program test
   use mpi 
   integer,       parameter                      :: dp   = kind(0.d0) 
   real(dp), dimension(:), allocatable           :: a, b, c, d, e
   integer, dimension(5)                         :: kproc  

   integer                                       :: myid, numprocs, ierr
   integer                                       :: i,n


   call MPI_INIT(ierr) 
   call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) 
   call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) 

   n = 5000 ! Can be much greater

   allocate(a(n), b(n), c(n), d(n), e(n))
   do i=1,5
     kproc(i)=mod(i-1,numprocs)
   enddo  

   if(myid==kproc(1)) then 
       a = 1.0_dp ! Complex computation for a
   endif
   if(myid==kproc(2)) then 
       b = 2.0_dp ! Complex computation for b
   endif
   if(myid==kproc(3)) then 
       c = 3.0_dp ! Complex computation for c
   endif
   if(myid==kproc(4)) then 
       d = 4.0_dp ! Complex computation for d
   endif
   if(myid==kproc(5)) then 
       e = 5.0_dp ! Complex computation for e
   endif


  call MPI_BCAST(a,n,MPI_DOUBLE_PRECISION,kproc(1),MPI_COMM_WORLD, ierr)
  call MPI_BCAST(b,n,MPI_DOUBLE_PRECISION,kproc(2),MPI_COMM_WORLD ,ierr)
  call MPI_BCAST(c,n,MPI_DOUBLE_PRECISION,kproc(3),MPI_COMM_WORLD ,ierr)
  call MPI_BCAST(d,n,MPI_DOUBLE_PRECISION,kproc(4),MPI_COMM_WORLD ,ierr)
  call MPI_BCAST(e,n,MPI_DOUBLE_PRECISION,kproc(5),MPI_COMM_WORLD ,ierr)
  d = d+e
  call MPI_FINALIZE(ierr) 
end program test

In this example, you can see that the computation of the five arrays a, b, c, d and e is splitted between the MPI processes. Notice also that d and e are in fact two parts of the same array : at the end, what matters is only the value of d = d+e.

  • 1
    Are all processes part of the computation, or just five out of many more? Maybe `MPI_Allgather` would be an option. It is a little difficult to understand what your example is supposed to represent, as `a,b,c` are unsed. – Zulan Feb 17 '16 at 22:55
  • This is not the performance issue you claim it is. – Jeff Hammond Feb 18 '16 at 03:45
  • 3
    Such problems are usually solved by having each rank work on a different section of the arrays it needs, not by having each array processed by a separate rank followed by a global data sync as in your case. – Hristo Iliev Feb 18 '16 at 09:08
  • But the MPI tutorials are full of BCAST of a big array to small parts in other ranks. I have never needed it. – Vladimir F Героям слава Feb 20 '16 at 13:33

0 Answers0