openMP excessive synchronization

Question

I am trying to add an openMP parallelization into quite a big Project and I found out the openMP does too much synchronization outside the parallel blocks.

This synchronization is done for all of the variables, even those not used in the parallel block and it is done continuously, not only before entering the block.

I made an example proving this:

#include <cmath>

int main()
{
    double dummy1 = 1.234;

    int const size = 1000000;
    int const size1 = 2500;
    int const size2 = 500;

    for(unsigned int i=0; i<size; ++i){

        //for (unsigned int j=0; j<size1; j++){
        //  dummy1 = pow(dummy1/2 + 1, 1.5);
        //}

        #pragma omp parallel for
        for (unsigned int j=0; j<size2; j++){
            double dummy2 = 2.345;
            dummy2 = pow(dummy2/2 + 1, 1.5);
        }
    }
}

If I run this code (with the for cycle commented), the runtimes are 6.75s with parallelization and 30.6s without. Great.

But if I uncomment the for cycle and run it again, the excessive synchronization kicks in and I get results 67.9s with parallelization and 73s without. If I increase size1 I even get slower results with parallelization than without it.

Is there a way to disable this synchronization and force it only before the second for cycle? Or any other way how to improve the speed?

Note that the outer neither the first for cycle are in the real example parallelizable. The outer one is in fact a ODE solver and the first inner one updating of loads of inner values.

I am using gcc (SUSE Linux) 4.8.5

Thanks for Your answers.

This question makes very little sense. You are saying that adding work outside OpenMP adds execution time. Is that surprising? If you run the code with your dummy loop commented in and your OpenMP loop completely commented out, you'll see that the first loop takes time! You seem to have reached a conclusion (that OpenMP is doing something it's not) based no no evidence! — Jim Cownie, Aug 21 '17 at 08:35
Is this code correct? `dummy1` seems to be unused in the parallel loop. Also, have you timed the first and second loop independently? It may be that the first loop takes so long to process that the effect of parallelization on the second loop is not noticeable when measured together. — Tudor, Aug 21 '17 at 14:41
@Tudor this code is not supposed to do anyting usable, it is juste an example. — David Sery, Aug 22 '17 at 09:46
@JimCownie: I thought the measured numbers are clear: If we only look at the results without parallelization we get that the second for loop takes 30.6s and the first one 73-30.6 = 42.4s. As for the resuls with the parallelization, the second loop takes 6.75s, whereas the first one 67.9-6.75 = 61.15s. Now I hope you see that the added work is not only by the for cycle itself, but also something extra -- something caused by openMP. I hope I made it clearer. — David Sery, Aug 22 '17 at 09:52
@David Sery: It may be because you're using the same variable name for both loops. Have you tried changing `j` to something else in the first loop? — Tudor, Aug 22 '17 at 10:31
@Tudor I did not tried that yet, but I will. I did not even consider this, because as I understand it the `j` variable should be local in this case. — David Sery, Aug 22 '17 at 10:46
@Tudor I tried that and it did not help. I am still getting around 70s instead of desired 50s. — David Sery, Aug 23 '17 at 05:11
What are your compile options? What hardware are you testing this on? You're version of GCC is ancient BTW. — Z boson, Aug 28 '17 at 08:23
I only use the -fopenmp and -o option so no optimization ruins my measurements (g++ timingDummy.cpp -o timingDummy -fopenmp). The CPU is Intel(R) Xeon(R) CPU E5-1620, 16GB RAM. Do you require more hardware info? What exactly? — David Sery, Aug 28 '17 at 08:50

score 0 · Accepted Answer · answered Sep 20 '17 at 07:36

In the end the solution for my problem was specifying number of threads = number of processor cores. It seems the hyperthreading was causing the problems. So using (my processor has 4 real cores)

#pragma omp parallel for num_threads(4)

I get times 8.7s without the first for loop and 51.9s with it. There is still about 1.2s overhead, but that is acceptable. Using default (8 threads)

#pragma omp parallel for

the times are 6.65s and 68s. Here the overhead is about 19s.

So the hyperthreading helps if no other code is present, but when it is it might not always be a good idea to use it.

openMP excessive synchronization

1 Answers1