I am trying to improve one of the code I am using for numerical simulations. One of the computation step requires to compute several large arrays, whose computation is complex and costly. What is done now is that each array is computed by a specific MPI-process, and then broadcasted to the other ones. The subroutine doing that is called very often during a run of the program, so it needs to be as fast as possible.
However, I suspect that the five successive MPI_BCAST are deleterious for the program performances... I did some tests using a non-blocking BCAST (MPI_IBCAST) and I saw an improvement of the performances. Unfortunately, I cannot use it at it does not seem to be available in some MPI implementations (at least the versions installed on the clusters I'm using...).
Do you have any ideas on how to improve this situation ? Below is a simplified version of the code I'm trying to optimize...
program test
use mpi
integer, parameter :: dp = kind(0.d0)
real(dp), dimension(:), allocatable :: a, b, c, d, e
integer, dimension(5) :: kproc
integer :: myid, numprocs, ierr
integer :: i,n
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
n = 5000 ! Can be much greater
allocate(a(n), b(n), c(n), d(n), e(n))
do i=1,5
kproc(i)=mod(i-1,numprocs)
enddo
if(myid==kproc(1)) then
a = 1.0_dp ! Complex computation for a
endif
if(myid==kproc(2)) then
b = 2.0_dp ! Complex computation for b
endif
if(myid==kproc(3)) then
c = 3.0_dp ! Complex computation for c
endif
if(myid==kproc(4)) then
d = 4.0_dp ! Complex computation for d
endif
if(myid==kproc(5)) then
e = 5.0_dp ! Complex computation for e
endif
call MPI_BCAST(a,n,MPI_DOUBLE_PRECISION,kproc(1),MPI_COMM_WORLD, ierr)
call MPI_BCAST(b,n,MPI_DOUBLE_PRECISION,kproc(2),MPI_COMM_WORLD ,ierr)
call MPI_BCAST(c,n,MPI_DOUBLE_PRECISION,kproc(3),MPI_COMM_WORLD ,ierr)
call MPI_BCAST(d,n,MPI_DOUBLE_PRECISION,kproc(4),MPI_COMM_WORLD ,ierr)
call MPI_BCAST(e,n,MPI_DOUBLE_PRECISION,kproc(5),MPI_COMM_WORLD ,ierr)
d = d+e
call MPI_FINALIZE(ierr)
end program test
In this example, you can see that the computation of the five arrays a, b, c, d and e is splitted between the MPI processes. Notice also that d and e are in fact two parts of the same array : at the end, what matters is only the value of d = d+e.