Reduction in Openmp returns different results with the same number of threads in my code

Question

My code with openmp using "reduction" doesn't return the same results from run to run.

Case 1: using "reduction"

sum = 0;
omp_set_num_threads(4);
#pragma omp parallel for reduction(+:sum)
for(ii = 0; ii < 100; i++)
   sum = sum + func(ii);

with func(ii) has side effects. In fact, func(ii) uses an other calcul() function which can lead to race condition in parallel execution. I think the calcul() function can be a reason for this problem. However, I use "critical", the results is always the same but this solution is not good for performance.

Case 2nd: using "critical"

sum = 0;
#pragma omp parallel for
for(ii = 0; ii < 100; i++)
{
   #pragma omp critical
   sum = sum + func(ii);
}

with the func(ii) function

func(int val) 
{
   read_file(val);
   calcul(); /*calculate something from reading_file(val)*/
   return val_fin;
}

Please help me to resolve it?

Thanks a lot!

It looks like your issue is not with the `reduction` part but in a race condition within the `func(ii)` call. Since we can't see the code for `func` or `calcul`, it's hard to say anything more. — Dan R, Jun 30 '16 at 04:40
Hi Dan, my func(ii) is also complicate then i can present to you its copy func(ii) { ...read(ii); ... calcul(); ... return value; } — hamalo, Jun 30 '16 at 04:59
You don't need to post the entire function, just enough to *recreate your issue* and make your question clear. See [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). — Dan R, Jun 30 '16 at 05:02

score 1 · Answer 1 · answered Jun 30 '16 at 05:10

The reason you're getting poor performance in the second case is the entire loop body is in a critical, so it can't actually execute anything in parallel.

Since you say there are some race conditions in the calcul function, consider putting a critical section just on that line inside func. That way, the files can be read in parallel (which may be the I/O that is slowing down your execution anyway).

If the performance is still poor, you will need to look into the nested calcul function and try to identify the race conditions.

Basically, you want to push any critical sections down as far as possible or eliminate them entirely. If it comes down to very simple updates to shared variables, in some cases you can use the OpenMP atomic pragma instead, which has better performance but is much less flexible.

Thanks @Dan. I focus on the same results, 'critical' can answer this question but it is not good for performance. I will try with your supporting — hamalo, Jun 30 '16 at 05:35

score 0 · Accepted Answer · edited May 23 '17 at 12:22

Even if everything in the code is correct, you still might get different results from the OpenMP reduction due to the associativity of the operations (additions). To be able to reproduce the same result for a given number of threads, you need to implement the reduction yourself by storing the partial sum of each thread in a shared array. After the parallel region, the master thread can add these results. This approach implies that the threads always execute the same iterations, i.e. a static scheduling policy.

Related question: Order of execution in Reduction Operation in OpenMP

Reduction in Openmp returns different results with the same number of threads in my code

2 Answers2