2

OpenMP 4.5+ provides the capability to do vector/array reductions in C++ (press release)

Using said capability allows us to write, e.g.:

#include <vector>
#include <iostream>

int main(){
  std::vector<int> vec;

  #pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

  #pragma omp parallel for default(none) schedule(static) reduction(merge: vec)
  for(int i=0;i<100;i++)
    vec.push_back(i);

  for(const auto x: vec)
    std::cout<<x<<"\n";

  return 0;
}

The problem is, upon executing such code, the results of the various threads may be ordered in any which way.

Is there a way to enforce order such that thread 0's results preceed thread 1's, and so on?

Richard
  • 56,349
  • 34
  • 180
  • 251
  • The `for` clause takes an optional `ordered`, see [OpenMP 4.5 API C/C++ Syntax Reference Guide](http://www.openmp.org/wp-content/uploads/OpenMP-4.5-1115-CPP-web.pdf). You should be able to combine this with `reduction`. – Henri Menke Jun 14 '17 at 08:13

1 Answers1

3

The order of a reduction is explicitly not specified. ("The location in the OpenMP program at which the values are combined and the order in which the values are combined are unspecified.", 2.15.3.6 in OpenMP 4.5). Therefore you cannot use a reduction.

One way would be to use ordered as follows:

std::vector<int> vec;
#pragma omp parallel for default(none) schedule(static) shared(vec)
for(int i=0;i<100;i++) {
    // do some computations here
    #pragma omp ordered
    vec.push_back(i);
}

Note that vec is now shared, and ordered implies a serialization of execution and synchronization among threads. This can be very bad for performance except if each of your computations require a significant and uniform amount of time.

You can make a custom ordered reduction. Split the parallel region from for loop and manually insert the local results in a sequential order.

std::vector<int> global_vec;
#pragma omp parallel
{
    std::vector<int> local_vec;
    #pragma omp for schedule(static)
    for (int i=0; i < 100; i++) {
        // some computations
        local_vec.push_back(i);
    }
    for (int t = 0; t < omp_get_num_threads(); t++) {
        #pragma omp barrier
        if (t == omp_get_thread_num()) {
            global_vec.insert(local_vec.begin(), local_vec.end())
        }
    }
}
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Great answer! In your latter approach, you may [do as in this answer](https://stackoverflow.com/a/27206559/5861244). I am not sure which is preferable but I would guess that there is more waiting with the `barrier` approach. – Benjamin Christoffersen Oct 06 '20 at 12:20