Does an OpenMP ordered for always assign parts of the loop to threads in order, too?

Question

Background

I am relying on OpenMP parallelization and pseudo-random number generation in my program but at the same I would like to make the results to be perfectly replicable if desired (provided the same number of threads).

I'm seeding a thread_local PRNG for each thread separately like this,

{
  std::minstd_rand master{};
  #pragma omp parallel for ordered
  for(int j = 0; j < omp_get_num_threads(); j++)
    #pragma omp ordered
    global::tl_rng.seed(master());
}

and I've come up with the following way of producing count of some elements and putting them all in an array at the end in a deterministic order (results of thread 0 first, of thread 1 next etc.)

std::vector<Element> all{};
...
#pragma omp parallel if(parallel)
{
  std::vector<Element> tmp{};
  tmp.reserve(count/omp_get_num_threads() + 1);

  // generation loop
  #pragma omp for
  for(size_t j = 0; j < count; j++)
    tmp.push_back(generateElement(global::tl_rng));

  // collection loop
  #pragma omp for ordered
  for(int j = 0; j < omp_get_num_threads(); j++)
    #pragma omp ordered
    all.insert(all.end(),
        std::make_move_iterator(tmp.begin()),
        std::make_move_iterator(tmp.end()));
}

The question

This seems to work but I'm not sure if it's reliable (read: portable). Specifically, if, for example, the second thread is done with its share of the main loop early because its generateElement() calls happened to return quick, won't it technically be allowed to pick the first iteration of the collecting loop? In my compiler that does not happen and it's always thread 0 doing j = 0, thread 1 doing j = 1 etc. as intended. Does that follow from the standard or is it allowed to be compiler-specific behaviour?

I could not find much about the ordered clause in the for directive except that it is required if the loop contains an ordered directive inside. Does it always guarantee that the threads will split the loop from the start in increasing thread_num? Where does it say so in referrable sources? Or do I have to make my "generation" loop ordered as well – does it actually make difference (performance- or logic-wise) when there's no ordered directive in it?

Please don't answer by experience, or by how OpenMP would logically be implemented. I'd like to be backed by the standard.

score 4 · Accepted Answer · edited May 23 '17 at 10:32

No, the code in its current state is not portable. It will work only if the default loop schedule is static, that is, the iteration space is divided into count / #threads contiguous chunks and then assigned to the threads in the order of their thread ID with a guaranteed mapping between chunk and thread ID. But the OpenMP specification does not prescribe any default schedule and leaves it to the implementation to pick one. Many implementations use static, but that is not guaranteed to always be the case.

If you add schedule(static) to all loop constructs, then the combination of ordered clause and ordered construct within each loop body will ensure that thread 0 will receive the the first chunk of iterations and will also be the first one to execute the ordered construct. For the loops that run over the number of threads, the chunk size will be one, i.e. each thread will execute exactly one iteration and the order of the iterations of the parallel loop will match those of a sequential loop. The 1:1 mapping of iteration number to thread ID done by the static schedule will then ensure the behaviour you are aiming for.

Note that if the first loop, where you initialise the thread-local PRNGs, is in a different parallel region, you must ensure that both parallel regions execute with the same number of threads, e.g., by disabling dynamic team sizing (omp_set_dynamic(0);) or by explicit application of the num_threads clause.

As to the significance of the ordered clause + construct, it does not influence the assignment of iterations to threads, but it synchronises the threads and makes sure that the physical execution order will match the logical one. A statically scheduled loop without an ordered clause will still assign iteration 0 to thread 0, but there will be no guarantee that some other thread won't execute its loop body ahead of thread 0. Also, any code in the loop body outside of the ordered construct is still allowed to execute concurrently and out of order - see here for a more detailed explanation.

Perfect. I upvoted the other answer you reference here earlier today, actually. I was just unsure about the following, which seems to be implied there and confirmed further by what you wrote here: so the meaning of the `ordered` clause for the enclosing `for` directive is nothing but to *mark* that an `ordered` directive *may* appear inside it, and makes no difference to the loop whatsoever if the latter does not happen? — The Vee, Nov 16 '16 at 16:00
Exactly. The directive makes the loop execution as ordered as it usually requires a different runtime function to support ordered loops. Which part of the loop body should be synchronised is declared using the construct. And indeed, since the `for` construct might be orphaned in, e.g., some external library code, the compiler cannot reliably determine whether a loop is ordered or not by just looking for the presence of the `ordered` construct in the body, even if the full call graph is examined. — Hristo Iliev, Nov 16 '16 at 16:17
Thank you! I thought that may be the case but was confused by G++ requiring that the ordered construct "be closely nested inside a loop region with an ‘ordered’ clause", which I would interpret as precluding being in separate functions or even farther (if that's what was meant by orphaning). — The Vee, Nov 16 '16 at 16:20
Hm, that's true and it's what the specification demands. Then it was probably to ease the parser in the earlier OpenMP versions. — Hristo Iliev, Nov 16 '16 at 16:33

Does an OpenMP ordered for always assign parts of the loop to threads in order, too?

Background

The question

1 Answers1

Linked