Random-Access Container
The simplest solution is to just throw everything into a random-access container (like std::vector
) and use the index-based loops that are favoured by OpenMP:
// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}
Apart from being incredibly simple, in your situation (tiny elements of type size_t
that can easily be copied) this is also the solution with the best performance and scalability.
Avoiding copies
However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:
// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
neListVector.push_back(&entry);
}
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}
This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.
Tasks and Dynamic Scheduling
Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.
For the min
reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min
itself!
For the more complex operation mesh->element_quality
you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality
varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic)
directive to your parallel for
line in one of my previous solutions. It achieves the same behaviour which far less overhead.