Yes, the way you explain it is how "Hristo Iliev" wanted it to work. Unfortunately, he got several things about tasks wrong:
His so-called "barrier" using taskwait
only works for the thread that created the tasks. All other threads run through it. So his whole reduction in the end is useless because most threads execute it before they actually do any work. The "taskwait" was designed to help to account for data dependencies (very useful in recursive functions!), that's why it works the way it does.
The only reason that the program works sometimes, even without a functional reduction in the end, is that his first mistake is partially cancelled out by a second mistake: The variable t_worst_q
within the task references the one from the thread that generated the task and therefore acts as a global variable. But then, of course, we have a race condition in the std::min
, which still lets the code fail sometimes.
So in effect his code is equivalent to the following:
#pragma omp parallel
{
// Create tasks
#pragma omp single nowait
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task shared(worst_q)
worst_q = std::min(worst_q, mesh->element_quality(elem));
}
}
}
This, as stated before, has the problem of having a race condition in the access to worst_q
when executing the std::min
. If the function mesh->element_quality
is computationally expensive compared to the std::min
function, then a simple solution is a critical section around the std::min
part (after having executed mesh->element_quality
outside the critical section). If, on the other hand, mesh->element_quality
isn't more expensive than std::min
, then the critical section will kill the parallelism. But in this case the function mesh->element_quality
is so cheap that the overhead for managing the tasks will eat up all the potential speed-up anyway.