12

I am experimenting with OpenMP. I wrote some code to check its performance. On a 4-core single Intel CPU with Kubuntu 11.04, the following program compiled with OpenMP is around 20 times slower than the program compiled without OpenMP. Why?

I compiled it by g++ -g -O2 -funroll-loops -fomit-frame-pointer -march=native -fopenmp

#include <math.h>
#include <iostream>

using namespace std;

int main ()
{
  long double i=0;
  long double k=0.7;

  #pragma omp parallel for reduction(+:i)
  for(int t=1; t<300000000; t++){       
    for(int n=1; n<16; n++){
      i=i+pow(k,n);
    }
  }

  cout << i<<"\t";
  return 0;
}
Duncan
  • 121
  • 1
  • 1
  • 3
  • 4
    **I've never used openMP** but it seems to me that the overhead of creating multiple threads and synchronizing the shared data access across those threads outweighs (a lot) the gain of distributing the processing across 4 different cores. – João Portela Jun 28 '11 at 13:20
  • But 20 times seems a bit too extreme. – Christian Rau Jun 28 '11 at 13:49
  • If you are trying to check OpenMP performance then using better-designed parallelizable code woould be a good idea. – Steve Townsend Jun 28 '11 at 14:03
  • 1
    I just took your program and ran it on a Ubuntu 11.04 system with 2 processors. I did a simple compile (g++ 4.5.2 with no options) and a compile with OpenMP (g++ -fopenmp) and ran them. The serial program had an elapse time of 6:45.41 and the OpenMP program (running on 2 processors) took 3:36.61 (using time to measure). Considering your program, this is what I would expect. I will try your options and see what happens. – ejd Jun 28 '11 at 14:08
  • Agree with ejd. I am seeing an approx 4 times speedup with OpenMP (using the options mentioned in the question) with gcc 4.8 on a 4 core machine. There is very little overhead. – Sameer Sep 11 '14 at 18:28

3 Answers3

16

The problem is that the variable k is considered to be a shared variable, so it has to be synced between the threads. A possible solution to avoid this is:

#include <math.h>
#include <iostream>

using namespace std;

int main ()
{
  long double i=0;

#pragma omp parallel for reduction(+:i)
  for(int t=1; t<30000000; t++){       
    long double k=0.7;
    for(int n=1; n<16; n++){
      i=i+pow(k,n);
    }
  }

  cout << i<<"\t";
  return 0;
}

Following the hint of Martin Beckett in the comment below, instead of declaring k inside the loop, you can also declare k const and outside the loop.

Otherwise, ejd is correct - the problem here does not seem bad parallelization, but bad optimization when the code is parallelized. Remember that the OpenMP implementation of gcc is pretty young and far from optimal.

olenz
  • 567
  • 1
  • 7
  • 15
  • 2
    That, and I would not be surprised if the compiler entirely optimized out the complete inner loop with all calls to `pow()` in the non-OMP case, since it can prove that k is constant and a loop with 16 iterations is within the default unroll depth. Recent versions of gcc have no trouble at all evaluating long calculations on `double`s at compile time without any precision loss. – Damon Jun 28 '11 at 14:14
  • Thanks olenz. Moving the declaration of k to inside the loop solved the speed problem. However, it requires re-declaration of k by 30000000 times inside the loop. I tried a different solution by keeping the declaration of k before the loop (like the original code) and change OpenMP code to "#pragma omp parallel for firstprivate(k) reduction(+:i) ", so k is no longer shared. However, it did not work. Program is still 20 times slower even though k is firstprivate. Why? – Duncan Jun 28 '11 at 14:40
  • Making "k" private makes no difference in my runs (which makes sense because it is never changed). Looking at the code generated, the serial case is optimized quite differently than the OpenMP version. That is what is giving the big performance difference. The OpenMP version is still doing all the calculations at runtime while the serial version is doing a lot of the work at compile time. More work has been done to optimize serial code than parallel code (so there are cases where serial runs faster than parallel - even though there is no reason the parallel code can't be optimized better). – ejd Jun 28 '11 at 14:41
  • 1
    It looks like -funroll-loops optimisation option does not work on OpenMP when k is declared before the loop, even when k is set private. Maybe it is a limitation of g++? I wonder if Intel's icc compiler can optimise it. – Duncan Jun 28 '11 at 15:14
  • 4
    It's also worth declaring everything const where possible - then OMP synchroniser knows it doesn't have to do anything. See also 32 OMP traps for c++ programmers (http://www.viva64.com/en/a/0054/) – Martin Beckett Jun 28 '11 at 15:27
  • Try using -O3 -ffast-math – Demi Sep 12 '13 at 13:19
3

Fastest code:

for (int i = 0; i < 100000000; i ++) {;}

Slightly slower code:

#pragma omp parallel for num_threads(1)
for (int i = 0; i < 100000000; i ++) {;}

2-3 times slower code:

#pragma omp parallel for
for (int i = 0; i < 100000000; i ++) {;}

no matter what it is in between { and }. A simple ; or a more complex computation, same results. I compiled under Ubuntu 13.10 64-bit, using both gcc and g++, trying different parameters -ansi -pedantic-errors -Wall -Wextra -O3, and running on an Intel quad-core 3.5GHz.

I guess thread management overhead is at fault? It doens't seem smart for OMP to create a thread everytime you need one and destroy it after. I thought there would be four (or eight) threads being either running whenever needed or sleeping.

George
  • 31
  • 2
0

I am observing similar behavior on GCC. However I am wondering if in my case it is somehow related with template or inline function. Is your code also within template or inline function? Please look here.

However for very short for loops, you may observe some small overhead related with thread switching like in your case:

#pragma omp parallel for
for (int i = 0; i < 100000000; i ++) {;}

If your loop executes for some seriously long time as few ms or even seconds, you should observe performance boost when using OpenMP. But only when you have more than one CPU. The more cores you have, the higher performance you reach with OpenMP.

Community
  • 1
  • 1
no one special
  • 1,608
  • 13
  • 32