Two openmp ordered blocks with no resulting parallelization

Question

I am writing a Fortran program that needs to have reproducible results (for publication). My understanding of the following code is that it should be reproducible.

program main
implicit none
real(8) :: ybest,xbest,x,y
integer :: i

ybest = huge(0d0)
!$omp parallel do ordered private(x,y) shared(ybest,xbest) schedule(static,1)
do i = 1,10
    !$omp ordered
    !$omp critical
    call random_number(x)
    !$omp end critical
    !$omp end ordered

    ! Do a lot of work
    call sleep(1)
    y = -1d0

    !$omp ordered
    !$omp critical
    if (y<ybest) then
    ybest = y
    xbest = x
    end if
    !$omp end critical
    !$omp end ordered
end do
!$omp end parallel do

end program

In my case, there is a function in place of "sleep" that takes long time to compute, and I want it done in parallel. According to OpenMP standards, should sleep in this example execute in parallel? I thought it should be (based on this How does the omp ordered clause work?), but with gfortran 5.2.0 (mac) and gfortran 5.1.0 (linux) it is not executing in parallel (at least, there is no speedup from it). The timing results are below.

Also, my guess is the critical statements are not necessary, but I wasn't completely sure.

Thanks.

-Edit-

In response to Vladmir's comments, I added a full working program with timing results.

#!/bin/bash
mpif90 main.f90
time ./a.out
mpif90 main.f90 -fopenmp
time ./a.out

The code runs as

real    0m10.047s
user    0m0.003s
sys 0m0.003s

real    0m10.037s
user    0m0.003s
sys 0m0.004s

BUT, if you comment out the ordered blocks, it runs with the following times:

real    0m10.044s
user    0m0.002s
sys 0m0.003s

real    0m3.021s
user    0m0.002s
sys 0m0.004s

Edit -

In response to innoSPG, here are the results for a non-trivial function in place of sleep:

real(8) function f(x)
    implicit none
    real(8), intent(in) :: x
    ! local
    real(8) :: tmp
    integer :: i
    tmp = 0d0
    do i = 1,10000000
        tmp = tmp + cos(sin(x))/real(i,8)
    end do
    f = tmp
end function


real    0m2.229s --- no openmp
real    0m2.251s --- with openmp and ordered
real    0m0.773s --- with openmp but ordered commented out

How do you set the number of threads? What is your OMP_NUM_THREADS? How did you determine it is not running in parallel? What exactly do the results of your performance measurement look like? — Vladimir F Героям слава, Aug 18 '15 at 15:30
The loop is run in parallel meaning that the 100 iterations are shared among the threads. Each thread run f on its own data onless you also have $omp in f. This is the parallelism in data. I guess this is what you mean. Now what is your question? — innoSPG, Aug 18 '15 at 15:32
Vladmir, I use the environment variable, OMP_NUM_THREADS equal to 6 for linux and 4 for max. `top` shows it is running at 100% or sometimes a little over. — gordon, Aug 18 '15 at 15:43
innoSPG, there is an omp pragma inside f, kind of. I have to declare some module variables private via `!$omp threadprivate`. I am not sure what you mean by "parallelism in data". I want each thread to execute `f` for different values of `x`. — gordon, Aug 18 '15 at 15:45
Perhaps I should also say that in the code that's not worried about reproducibility (so that the ordered constructs are missing), `top` shows roughly 400% or 600%. — gordon, Aug 18 '15 at 15:48
Show us your results that tell you that the loop is not running in parallel! Show how exactly do the codes differ. Are you sure you need the 100% reproducibility? What exactly tells you that this ordered is the only correct order? — Vladimir F Героям слава, Aug 18 '15 at 15:48
By parralelism in data, I mean different threads run the same code on different data — innoSPG, Aug 18 '15 at 16:21
(you have remove `f` from your code) is `f` parallelizable? if you run `f` once (not in a loop), is there a gain in compiling with openmp versus not compiling with openmp? (assuming of cause that you run it with OMP_NUM_THREADS>1) what is the gain parallel versus sequential? — innoSPG, Aug 18 '15 at 16:25
innoSPG, I provided an example above where `f` is `sleep(1); y = -1d0`, a trivial function analogous to `y=f(x)`. You can see the timing results, but the speedup without the ordered constructs is around 3.33 (10 sec/3 sec). Once the ordered constructs are in place, there is no speed up (10 sec/10 sec) — gordon, Aug 18 '15 at 16:27
`sleep` is not parallelizable. If you remove the loop there will be no difference between parallel and sequential. the parallel will even be slow with the overhead system call. My question is about your actual `f` that you said you also have $omp directives in. Now, I want to understand before I suggest a path. — innoSPG, Aug 18 '15 at 16:32
please see the new version where i call a non-trivial function — gordon, Aug 18 '15 at 17:08
This makes no sense. `cos(sin(x))` does not change in the loop and the compiler will probably optimize it away... Show your real code! — Alexander Vogt, Aug 18 '15 at 17:19

casey · Accepted Answer · 2015-08-18T17:41:52.723

This program is non-conforming to the OpenMP standard. Specifically, the problem is that you have more than one ordered region and every iteration of your loop will execute both of them. The OpenMP 4.0 standard has this to say (2.12.8, Restrictions, line 16, p 139):

During execution of an iteration of a loop or a loop nest within a loop region, a thread must not execute more than one ordered region that binds to the same loop region.

If you have more than one ordered region, you must have conditional code paths such that only one of them can be executed for any loop iteration.

It is also worth noting the position of your ordered region seems to have performance implications. Testing with gfortran 5.2, it appears everything after the ordered region is executed in order for each loop iteration, so having the ordered block at the beginning of the loop leads to serial performance while having the ordered block at the end of the loop does not have this implication as the code before the block is parallelized. Testing with ifort 15 is not as dramatic but I would still recommend structuring your code so your ordered block occurs after any code than needs parallelization in a loop iteration rather than before.

Thank you very much! After restructuring the code to only have one ordered section, the time is 1.056 with the single ordered construct and 3.240 without it. — gordon, Aug 18 '15 at 17:23

Two openmp ordered blocks with no resulting parallelization

1 Answers1

Linked