OpenMP: Why is reduction so much faster than splitting up the task

Asked Apr 16 '20 at 14:30

Active Apr 16 '20 at 15:43

Viewed 52 times

I implemented 2 versions of of the pi approximation. I tested it and noticed that one version is much faster but i don't really understand why. In the first version i created an array of the size of defined number of processes and updating the indexes, in the second version i used just reduction.

first version:

#pragma omp parallel private(x) shared(sum_vector)
    {
        int tid = omp_get_thread_num();
        for (int i = tid; i < num_steps; i += threads_number){
            x = (i+0.5)*step;
            sum_vector[tid] += 4.0/(1.0+x*x);
        }
    }

second version:

#pragma omp parallel reduction(+:sum) private(x)
{
    int nthreads = omp_get_num_threads();
    int id = omp_get_thread_num();
    for (int i = id; i < num_steps; i += nthreads){
        x = (i+0.5)*step;
        sum = sum + 4.0/(1.0+x*x);
    }

}

The second version is almost twice as fast for 1 Million iterations or higher.

I would appreciate every answer! Thank you in advance!

edited Apr 16 '20 at 15:43

asked Apr 16 '20 at 14:30

whoami1996

I don't know anything about openMP, but my guess would be that the bottleneck in the first version would be synchronisation of access to the `shared(sum_vector)`. – Hulk Apr 16 '20 at 14:43
1

please include a [mcve], pick one language, and tell us what compiler options you used. – 463035818_is_not_an_ai Apr 16 '20 at 14:43
this could be a nice and interesting question, but talking about run-time whithout knowing what is the optimization level is pointless, hence I vote to close – 463035818_is_not_an_ai Apr 16 '20 at 15:06
2

https://en.wikipedia.org/wiki/False_sharing causing contention for `sum_vector` even though no single element is shared. – Ben Voigt Apr 16 '20 at 15:47
This is the problem, thank you so much! @BenVoigt – whoami1996 Apr 16 '20 at 15:56

OpenMP: Why is reduction so much faster than splitting up the task

0 Answers0