OpenMP parallel thread

Question

I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.

 #pragma omp parallel for

for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
        it!=mesh->NEList[vid].end(); ++it){

    worst_q = std::min(worst_q, mesh->element_quality(*it));
}

In this case the loop is not parallelized because it uses iterator and the compiler cannot understand how to slit it.

Can You help me?

eek... I'm not sure if OpenMP has support for min-reductions. — Mysticial, Mar 04 '13 at 18:00
I agree with @Mysticial. min-reduction is terrible with OpenMP — Andreas Grapentin, Mar 04 '13 at 18:03
Please re-read your questions after asking them, there was a problem when you copy pasted the code. — J.N., Mar 04 '13 at 18:03
"but I never studied them before" premature optimization is the root of many bugs... — danodonovan, Mar 04 '13 at 18:03
How big do you expect your std::set to be? is iterating over it a large performance issue? more specifically - do you expect the parallelization benefit to outperform the parallelization overhead? OpenMP is generally designed for easy parallelization of low(er)-level constructs, like calculating the sum over a large array, or doing simple task-parallelization. min-reduction, as in reduce the array to the minimum value, is still a major feature hole in OpenMP in general. And, as I said, even if it worked, I'd guess it'd be slower than the sequential version. — Andreas Grapentin, Mar 04 '13 at 18:08
@danodonovan this rule does not necessarily apply when learning about when and where to use parallelization. I always get the impression that it discourages learners to learn to write efficient code, which is not a good thing. — Andreas Grapentin, Mar 04 '13 at 18:10
@danodonovan please note that I generally agree with you, I just think it's important to get across the basics of optimizations, even if it is used prematurely as heck in this context. — Andreas Grapentin, Mar 04 '13 at 18:12
@Andreas Grapentin point taken, but might you not suggest learning to write efficient code first, and then optimizing a much simpler example? — danodonovan, Mar 04 '13 at 18:13
@danodonovan generally yes, but that wouldn't fit that well in the context of the question :) Also, I gave up on trying to teach upcoming programmers the art of writing efficient code. — Andreas Grapentin, Mar 04 '13 at 18:15
@Guiseppe, An OMP loop needs to know how big it is and (to my knowledge) using the C++ iterators will not work. In any case, I highly doubt it would work on a set anyway: how does it slice up a set? If the operations to perform outweigh the cost of converting the set to a literal C array then I'd do that. — Levi Morrison, Mar 04 '13 at 18:46

score 1 · Accepted Answer · edited Jun 06 '15 at 16:34

OpenMP requires that the controlling predicate in parallel for loops has one of the following relational operators: <, <=, > or >=. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.

double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);

#pragma omp parallel
{
   #pragma omp single
   {
      t_worst_q = new double[omp_get_num_threads() * cb];
      for (int i = 0; i < omp_get_num_threads(); i++)
         t_worst_q[i * cb] = worst_q;
   }

   // Perform partial min reduction using tasks
   #pragma omp single
   {
      for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
          it!=mesh->NEList[vid].end(); ++it) {
         size_t elem = *it;
         #pragma omp task
         {
            int tid = omp_get_thread_num();
            t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
                                           mesh->element_quality(elem));
         }
      }
   }

   // Perform global reduction
   #pragma omp critical
   {
      int tid = omp_get_thread_num();
      worst_q = std::min(worst_q, t_worst_q[tid * cb]);
   }
}

delete [] t_worst_q;

(I assume that mesh->element_quality() returns double)

Some key points:

The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
Idle threads waiting at the implicit barrier of the single construct begin consuming tasks as soon as they are created.
The value pointed by it is dereferenced before the task body. If dereferenced inside the task body, it would be firstprivate and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want.
Each thread performs partial reduction in its private part of the t_worst_q[].
In order to prevent performance degradation due to false sharing, the elements of t_worst_q[] that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied by cb = 64 / sizeof(double).
The global min reduction is performed inside a critical construct to protect worst_q from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.

Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).

This is terribly wrong, you haven't understood tasks. Your so-called "barrier" using "taskwait" only works for the thread that created the tasks. All other threads run through it. The "taskwait" concept helps to account for data dependencies (very useful in recursive functions!), nothing more. The only reason that your program works sometimes is that your first mistake is partially cancelled out by a second mistake: The variable "t_worst_q" within the task references the one from the thread that generated the task and acts as a global variable. But then you have a race condition in the "min". — mastov, Jun 05 '15 at 17:50
Indeed, the answer was wrong. Wonder how it was able to slip unnoticed for more than two years. — Hristo Iliev, Jun 06 '15 at 12:18
Still doesn't work. It doesn't compile: You are missing a cast in the allocation of t_worst_q. And it gives wrong results: The indices in the initialization of t_worst_q are wrong. — mastov, Jun 06 '15 at 15:18
I should be more careful when copy-paste-merging code by hand. I tested it with a vector of integers and later added the strided access to the temporary array. Thanks for fixing it. — Hristo Iliev, Jun 06 '15 at 21:49

score 0 · Answer 2 · answered Jun 06 '15 at 15:16

Random-Access Container

The simplest solution is to just throw everything into a random-access container (like std::vector) and use the index-based loops that are favoured by OpenMP:

// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());

// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
    worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}

Apart from being incredibly simple, in your situation (tiny elements of type size_t that can easily be copied) this is also the solution with the best performance and scalability.

Avoiding copies

However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:

// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
    neListVector.push_back(&entry);
}

// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
    worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}

This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.

Tasks and Dynamic Scheduling

Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.

For the min reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min itself!

For the more complex operation mesh->element_quality you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic) directive to your parallel for line in one of my previous solutions. It achieves the same behaviour which far less overhead.

OpenMP parallel thread

2 Answers2

Random-Access Container

Avoiding copies

Tasks and Dynamic Scheduling

Linked