Single vs OpenMP vs MPI - Fortran

Question

I am new in MPI Programming. I have to test 3 codes, such as sequential, OpenMP and MPI codes. These 3 codes (not the real codes, just for example) are given respectively as follow

The sequential code

 program no_parallel
 implicit none
 integer, parameter                         :: dp = selected_real_kind(15,307)  
 integer                                    :: i, j
 real(kind = dp)                            :: time1, time2
 real(kind = dp), dimension(1000000)        :: a
    !Initialisation
        do i = 1, 1000000
           a(i) = sqrt( dble(i) / 3.0d+0 ) 
        end do  
    call cpu_time( time1 )
        do j = 1, 1000 
             do i = 1, 1000000
                a(i) = a(i) + sqrt( dble(i) ) 
             end do 
        end do
    call cpu_time( time2 )
    print *, a(1000000)
    print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
 end program no_parallel

The OpenMP code

 program openmp
 implicit none
 integer, parameter                         :: dp = selected_real_kind(15,307)  
 integer                                    :: i, j
 real(kind = dp)                            :: time1, time2, omp_get_wtime
 real(kind = dp), dimension(1000000)        :: a
    !Initialisation
        do i = 1, 1000000
           a(i) = sqrt( dble(i) / 3.0d+0 ) 
        end do 
    time1 = omp_get_wtime()
     !$omp parallel
        do j = 1, 1000
          !$omp do schedule( runtime ) 
             do i = 1, 1000000
                a(i) = a(i) + sqrt( dble(i) ) 
             end do
          !$omp end do 
        end do
     !$omp end parallel 
    time2 = omp_get_wtime()
    print *, a(1000000)
    print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
 end program openmp

The MPI code

 program MPI
 implicit none
 include "mpif.h"
 integer, parameter                         :: dp = selected_real_kind(15,307)
 integer                                    :: ierr, num_procs, my_id, destination, tag, source, stat, i, j
 real(kind = dp)                            :: time1, time2
 real(kind = dp), dimension(1000000)        :: a
    call MPI_INIT ( ierr )
    call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
    call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )  
    !Initialisation
        do i = 1, 1000000
           a(i) = sqrt( dble(i) / 3.0d+0 ) 
        end do
    destination = 0
    tag = 999
    source = 3
    stat = MPI_STATUS_SIZE
    time1 = MPI_Wtime()
        do j = 1, 1000     
           do i = 1 + my_id, 1000000, num_procs
              a(i) = a(i) + sqrt( dble(i) ) 
           end do
        end do
    call MPI_BARRIER ( MPI_COMM_WORLD, ierr )
    if( my_id == source ) then 
        call MPI_SEND ( a(1000000), 1, MPI_DOUBLE_PRECISION, destination, tag, MPI_COMM_WORLD, ierr )
    end if
    if( my_id == destination ) then
        call MPI_RECV ( a(1000000), 1, MPI_DOUBLE_PRECISION, source, tag, MPI_COMM_WORLD, stat, ierr )
    end if
    time2 = MPI_Wtime()
    if( my_id == 0) then
        print *, a(1000000)    !, 'from ID =', my_id 
        print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
    end if
    stop 
    call MPI_FINALIZE ( ierr )
 end program MPI

I compiled these codes using Intel Fortran Compiler 17.0.3 with -O0 optimisation flag. Both the OpenMP and MPI codes were performed on 4 cores Haswell Desktop. I got the CPU times for the sequential, OpenMP and MPI codes 8.08s, 2.1s and 3.2s respectively. Actually, I was expecting that the results between OpenMP and MPI codes are almost similar; however, it wasn't. My questions:

Regarding the MPI code, if I want to print out the results of a(1000000), is it possible to do that in a smarter way without doing such call MPI_SEND and call MPI_RECV?
Do you have any idea which part of the MPI code that can still be optimised?
With regard to source in the MPI code, is it possible to define it automatically? In this case, it is easy for me, since the number of processors is 4, so a(1000000) must be allocated to thread 3.

Thank you in advance.

Performance comparisons of unoptimized (`-O0`) codes are useless and have no practical relevance. It is not worth discussing. You also want to use a wall clock, instead of CPU time (https://stackoverflow.com/a/6880133/620382). — Zulan, May 30 '17 at 16:06
@Zulan: Thanks, but you didn't help at all :) I posted this of course with a reason. I have compiled the MPI code with -O3, but the CPU time remained unchanged (still 3.2s). Since I am also new in MPI programming, I would like to know the basic without using the optimisation flag. Regarding the CPU time in your answer, this CPU time works only in the sequential code. So, I do not need actually this CPU time. I would like to now if one can still optimise the MPI code compared to the OpenMP one. Anyway thanks for your answer. — bestrong, May 30 '17 at 17:16
Your sequential code is using CPU time (we assume, from the function you call :-)), but your parallel codes are using elapsed, Wall-clock time. (Which is why the functions have a 'W' in the name...). So you can't sensibly compare them. — Jim Cownie, May 31 '17 at 08:25
@JimCownie He should have used system_clock, but in practice the result will be almost the same. — Vladimir F Героям слава, May 31 '17 at 10:04
@JimCownie: Thanks for the answer; but again my questions haven't been answered yet :D :D There are 3 questions written on my post and none of them discussed about the sequential code :) I posted the sequential code, just to give an interpretation to readers. Also, I'm only interested in comparing the OpenMP and MPI codes (as written in my questions). If you say that I can't sensibly compare them, the CPU time of sequential and OpenMP codes are 8.08s and 2.1s respectively, which are totally different :) :) Anyway, thanks. — bestrong, May 31 '17 at 10:32
@VladimirF: I do agree with you. I am just confused now why some people commented on those CPU times, rather than answering my questions :D — bestrong, May 31 '17 at 10:33
@bob.bob.bob Because it cannot be reasonably answered without speculation. And people don't want to do that. That's whey they told you your question is not good. And I think the same BTW. I upvoted the first comment under the question. There is nothing interesting to be answered here. — Vladimir F Героям слава, May 31 '17 at 11:01
@bob.bob.bob The point was that for SIMD reduction, the example may work better than your 2 second example on one core. Some people choose to optimise the single core performance first - before then working it across multiple cores. Whether it helps or not I do not know, but I am sensing 'not'. — Holmz, Jun 01 '17 at 09:12

bestrong · Accepted Answer · 2017-06-01T14:47:12.603

Finally, I got the solution of my problem. Previously, I didn't realise that the way of parallelising do loop in the serial code:

do i = 1, 1000000
   a(i) = a(i) + sqrt( dble(i) ) 
end do

to be cyclic distribution in the MPI code:

do i = 1 + my_id, 1000000, num_procs
   a(i) = a(i) + sqrt( dble(i) ) 
end do

is the problem. I assume this because more cache misses occur. Therefore, instead of cyclic distribution, I apply block distribution to the MPI code, which is more efficient (for this case!!!). I write now a revised MPI code as:

 program Revised_MPI
 use mpi
 implicit none
 integer, parameter                            :: dp = selected_real_kind(15,307), array_size = 1000000
 integer                                       :: ierr, num_procs, my_id, ista, iend, i, j
 integer, dimension(:), allocatable            :: ista_idx, iend_idx
 real(kind = dp)                               :: time1, time2
 real(kind = dp), dimension(:), allocatable    :: a

    call MPI_INIT ( ierr )
    call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
    call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )

    !Distribute loop with block distribution
    call para_range ( 1, array_size, num_procs, my_id, ista, iend )
    allocate ( a( ista : iend ), ista_idx( num_procs ), iend_idx( num_procs ) )

    !Initialisation and saving ista and iend
        do i = ista, iend
           a(i) = sqrt( dble(i) / 3.0d+0 )
           ista_idx( my_id + 1 ) = ista
           iend_idx( my_id + 1 ) = iend  
        end do

    time1 = MPI_Wtime()

    !Performing main calculation for all processors (including master and slaves)
    do j = 1, 1000     
       do i = ista_idx( my_id + 1 ), iend_idx( my_id + 1 )
          a(i) = a(i) + sqrt( dble(i) ) 
       end do
    end do
    call MPI_BARRIER ( MPI_COMM_WORLD, ierr )

    time2 = MPI_Wtime()

    if( my_id == num_procs - 1 ) then
        print *, a( array_size )
        print *, 'Elapsed real time = ', time2 - time1, 'second(s)' 
    end if

    call MPI_FINALIZE ( ierr )
    deallocate ( a )
 end program Revised_MPI

!-----------------------------------------------------------------------------------------
 subroutine para_range ( n1, n2, num_procs, my_id, ista, iend )

 implicit none

 integer                                       :: n1, n2, num_procs, my_id, ista, iend, &
                                                  iwork1, iwork2

    iwork1 = ( n2 - n1 + 1 ) / num_procs
    iwork2 = mod( n2 - n1 + 1, num_procs )
    ista = my_id * iwork1 + n1 + min( my_id, iwork2 )
    iend = ista + iwork1 - 1
    if( iwork2 > my_id ) then
        iend = iend + 1
    end if

 end subroutine para_range
!-----------------------------------------------------------------------------------------

Now, the MPI code can achieve a(n) (almost) similar CPU time with that of OpenMP. Also, it works perfectly for the uses of optimisation flags -O3 and -fast.

Thank you ALL for your help. :)

That is already in my answer. Note that the code will not compile because it uses module with the same name as the name of a module being used. — Vladimir F Героям слава, Jun 01 '17 at 14:30

Vladimir F Героям слава · Answer 2 · 2017-06-01T05:50:11.833

Actually, your MPI program does not make much sense to me. Why do all ranks have the same full array? Why do you want to copy the full array? Why just between this particular source and destination?

The program does not compute anything useful, so it is really hard to tell, what would be the correct program (that does not compute anything useful correctly).

In many MPI programs you don't ever send and receive the whole arrays. Not even the complete local arrays, but just some boundaries between them.

So I came up with this. Notice the use mpi and that I removed the magic number 1000000 from everywhere.

I also removed the stop. Stop before end is just a bad habit, but it is not harmful. Putting it before MPI_Finalize() is actively harmful.

And most importantly, I distributed the work differently. Every rank has its own part of the array it works on.

program Test_MPI

use mpi

 implicit none

 integer, parameter                         :: dp = selected_real_kind(15,307)
 integer                                    :: ierr, num_procs, my_id, stat, i, j
 real(kind = dp)                            :: time1, time2
 real(kind = dp), dimension(:), allocatable :: a
 integer, parameter                         :: n = 1000000
 integer                                    :: my_n, ns
    call MPI_INIT ( ierr )
    call MPI_COMM_RANK ( MPI_COMM_WORLD, my_id, ierr )
    call MPI_COMM_SIZE ( MPI_COMM_WORLD, num_procs, ierr )  

    my_n = n / num_procs
    ns = my_id * my_n
    if (my_id == num_procs-1) my_n = n - ns

    allocate(a(my_n))

    !Initialisation
        do i = 1, my_n
           a(i) = sqrt( real(i+ns, dp) / 3.0d+0 ) 
        end do

    stat = MPI_STATUS_SIZE
    time1 = MPI_Wtime()
        do j = 1, 1000     
           do i = 1 , my_n 
              a(i+my_id) = a(i) + sqrt( real(i+ns, dp) ) 
           end do
        end do
    call MPI_BARRIER ( MPI_COMM_WORLD, ierr )

    time2 = MPI_Wtime()
    if( my_id == 0) then
        !!!! why???  print *, a(my_n) 
        print *, 'Elapsed real time = ', time2 - time1, 'second(s)'
    end if

    call MPI_FINALIZE ( ierr )
 end program Test_MPI

Yes, there is no communication in there. I can't think of why it should be there. If it should, you have to tell us why. It should scale almost perfectly.

Maybe you want to gather the final array in one rank? Many people do that, but often it is not needed at all. It is not clear why it would be needed in your case.

Shouldn't the bounds for the inner main loop. be `do i = 1, my_n`? As it stands there is a bounds error for `my_id>0`. I don't think this is reproducing the original anyway. There should be some offsets calculated and applied as either `a(i) = sqrt( real(i+local_offset, dp) / 3.0d+0 )` or changes made to the array bounds. — RussF, Jun 01 '17 at 01:36
@RussF That was just a quick change in the last edit and I put the factor in the wrong line. Otherwise the code **was tested** with debuggimg optuons on. I don't actually think one my provide a fully debugged code as an answer to this type of qustion... — Vladimir F Героям слава, Jun 01 '17 at 05:48

score -1 · Answer 3 · answered May 31 '17 at 08:46

I find one generally needs a lot more work in the SUBROUTINE or FUNCTION to make parallelism pay off, so focusing on vectorising is you best approach in this example.
The moniker is, "Vecorize inner - Parallelise outer" (VIPO)
For the second case I would suggest the following:

MODULE MyOMP_Funcs
IMPLICIT NONE
PRIVATE

integer, parameter, PUBLIC           :: dp = selected_real_kind(15,307)  
real(kind = dp), dimension(1000000)  :: a

PUBLIC  MyOMP_Init, MyOMP_Sum

CONTAINS

!=================================
SUBROUTINE MyOMP_Init(N,A)
IMPLICIT NONE

integer                      , INTENT(IN   ) :: N  
real(kind = dp), dimension(n), INTENT(INOUT) :: A

integer                                      :: I 

!Initialisation
DO i = 1, n
  A(i) = sqrt( dble(i) / 3.0d+0 ) 
ENDDO 

RETURN
END SUBROUTINE MyOMP_Init


!=================================
SUBROUTINE MyOMP_Sum(N,A,SumA)
!$OMP DECLARE SIMD(MyOMP_Sum) UNIFORM(N,SumA) linear(ref(A))
USE OMPLIB
IMPLICIT NONE

integer                      , INTENT(IN   ) :: N
!DIR$ ASSUME_ALIGNED A: 64                   :: A
real(kind = dp), dimension(n), INTENT(IN   ) :: A
real(kind = dp)              , INTENT(  OUT) :: SumA

integer                                      :: I

SumA = 0.0

!Maybe also try...  !DIR$ VECTOR ALWAYS

!$OMP SIMD REDUCTION(+:SumA)
Sum_Loop: DO i = 1, N
  SumA = SumA + A(i) + sqrt( dble(i) ) 
ENDDO Sum_Loop
!$omp end   !<-- You probably do not need these

RETURN
END SUBROUTINE MyOMP_Sum

!=================================
SUBROUTINE My_NOVEC_Sum_Sum(N,A,SumA)
IMPLICIT NONE

integer                      , INTENT(IN   ) :: N
!DIR$ ASSUME_ALIGNED A: 64                   :: A
real(kind = dp), dimension(n), INTENT(IN   ) :: A
real(kind = dp)              , INTENT(  OUT) :: SumA

integer                                      :: I

SumA = 0.0

!DIR$ NOVECTOR
Sum_Loop: DO i = 1, N
  SumA = SumA + A(i) + sqrt( dble(i) ) 
ENDDO Sum_Loop

RETURN
END SUBROUTINE My_NOVEC_Sum

!=================================
END MODULE MyOMP_Funcs
!=================================


!=================================
program openmp
!USE OMP_LIB 
USE MyOMP_Funcs 
implicit none

integer        , PARAMETER          :: OneM = 1000000
integer        , PARAMETER          :: OneK = 1000
integer                             :: i, j
real(kind = dp)                     :: time1, time2, omp_get_wtime
!DIR$ ATTRIBUTES ALIGNED:64         :: A, SumA
real(kind = dp), dimension(OneM)    :: A
real(kind = dp)                     :: SumA
!Initialisation

CALL MyOMP_Init(N,A)
time1 = omp_get_wtime()

!  !$omp parallel
!    do j = 1, OneK

CALL MyOMP_Sum(OneM, A, SumA)

!    end do
!  !$omp end parallel 
!!--> Put timing loops here

time2 = omp_get_wtime()
print *, a(1000000)
print *, 'Elapsed real time = ', time2 - time1, 'second(s)'

end program openmp

Once you have a SIMD REDUCTION version running, then you can try layering on parallelism.
If the module is part of a library then the compiler settings are independent of the program.

So the answer to the question why there is a difference between OpenMP and MPI run time is ...? — Vladimir F Героям слава, May 31 '17 at 10:05
@Holmz: What is your point of posting this? As I wrote, I use -O0 optimisation flag. Of course, I can optimise this with SIMD, -O3, -fast, etc. My question is, why on the basic level performance (without optimisation nor simd nor whatever), the OpenMP and MPI codes above differ quite lot? Is it because I wrote the MPI code inefficiently? That's the point of my question. Anyway thanks :) — bestrong, May 31 '17 at 10:40

Single vs OpenMP vs MPI - Fortran

3 Answers3