2
#pragma omp parallel for ordered
for (int i = 0; i < n; ++i) {
  ... code happens nicely in parallel here ...
  #pragma omp ordered
  {
    .. one at a time in order of i, as expected, good ...
  }
  ... single threaded here but I expected parallel ...
}

I expected the next thread to enter the ordered section as soon as this thread left the ordered section. But the next thread only enters the ordered section when the for loop's body ends. So the code after the ordered section ends goes serially.

The OpenMP 4.0 manual contains :

The ordered construct specifies a structured block in a loop region that will be executed in the order of the loop iterations. This sequentializes and orders the code within an ordered region while allowing code outside the region to run in parallel.

Where I've added the bold. I'm reading "outside" to include after the ordered section ends.

Is this expected? Must the ordered section in fact be last?

I've searched for an answer and did find one other place where someone observed similar nearly 2 years ago : https://stackoverflow.com/a/32078625/403310 :

Testing with gfortran 5.2, it appears everything after the ordered region is executed in order for each loop iteration, so having the ordered block at the beginning of the loop leads to serial performance while having the ordered block at the end of the loop does not have this implication as the code before the block is parallelized. Testing with ifort 15 is not as dramatic but I would still recommend structuring your code so your ordered block occurs after any code than needs parallelization in a loop iteration rather than before.

I'm using gcc 5.4.0 on Ubuntu 16.04.

Many thanks.

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224

1 Answers1

6

There is no need for the ordered region to be last. The behavior you observe is implementation dependent, and a known flaw in libgomp (the OpenMP runtime library from gcc). I suppose this behavior is tolerated by the standard though clearly not optimal.

Technically, the compiler produces the following code from the annotations:

#pragma omp parallel for ordered
for (int i = 0; i < n; ++i) {
  ... code happens nicely in parallel here ...
  GOMP_ordered_start();
  {
    .. one at a time in order of i, as expected, good ...
  }
  GOMP_ordered_end();
  ... single threaded here but I expected parallel ...
  GOMP_loop_ordered_static_next();
}

Unfortunately, GOMP_ordered_end is implemented as follows:

/* This function is called by user code when encountering the end of an
   ORDERED block.  With the current ORDERED implementation there's nothing
   for us to do.

   However, the current implementation has a flaw in that it does not allow
   the next thread into the ORDERED section immediately after the current
   thread exits the ORDERED section in its last iteration.  The existance
   of this function allows the implementation to change.  */

void
GOMP_ordered_end (void)
{
}

I speculate, that just never was a significant use case for this given that ordered is probably commonly used in the sense of:

#pragma omp parallel for ordered
for (...) {
    result = expensive_computation()
    #pragma omp ordered
    {
        append(results, result);
    }
}

The OpenMP runtime from the Intel compiler does not suffer from this flaw.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Thank you! Exactly what I was looking for, and more. – Matt Dowle Apr 21 '17 at 19:21
  • 1
    Note that the LLVM runtime (http://openmp.llvm.org) is effectively identical to the Intel runtime, and both can be used as a dynamic load time replacement for libgomp (unless you are using some task related features for which LLVM's libomp hasn't yet got the gcc shims). (FWIW I work for Intel on the OpenMP runtime :-)) – Jim Cownie Apr 24 '17 at 09:30