OpenMP parallel spiking

Question

I'm using OpenMP in Visual Studio 2010 to speed up loops.

I wrote a very simple test to see the performance increase using OpenMP. I use omp parallel on an empty loop

int time_before = clock();

#pragma omp parallel for
for(i = 0; i < 4; i++){

}

int time_after = clock();

std::cout << "time elapsed: " << (time_after - time_before) << " milliseconds" << std::endl;

Without the omp pragma it consistently takes 0 milliseconds to complete (as expected), and with the pragma it usually takes 0 as well. The problem is that with the opm pragma it spikes occasionally, anywhere from 10 to 32 milliseconds. Every time I tried parallel with OpenMP I get these random spikes, so I tried this very basic test. Are the spikes an inherent part of OpenMP, or can they be avoided?

The parallel for gives me great speed boosts on some loops, but these random spikes are too big for me to be able to use it.

You wouldn't happen to be using windows? I had spikes of up to 150ms on windows with visual studio's openmp implementation. — Joe, Jun 29 '14 at 09:31
There is always some overhead when using threads, make sure you have enough work for each one for threading to be profitable. — Vladimir F Героям слава, Jun 29 '14 at 09:35
It should also be mentioned that the precision of clock is extremely low. Often 10-15ms, which is extremely imprecise and not really usable for profiling. — Joe, Jun 29 '14 at 09:39
I do not use openmp for loops that I expect to be finished in less then a few seconds. The overhead of creating threads is too large for small loops. — drescherjm, Jun 29 '14 at 12:44

score 2 · Answer 1 · answered Jun 29 '14 at 07:15

2

Thats pretty normal behiavor. Sometimes your operation system is busy and need more time to spawn new threads.

answered Jun 29 '14 at 07:15

kukis

4,489
6
27
50

Sedenion · Answer 2 · 2014-06-29T09:26:37.197

I want to complement the answer of kukis: I'd also say, that the reason for the spikes are due to the additional overhead that comes with OpenMP.

Furthermore, as you are doing performance-sensitive measurements, I hope that you compiled your code with optimizations turned on. In that case, the loop without OpenMP simply gets optimized out by the compiler, so there is no code in between time_before and time_after. With OpenMP, however, at least g++ 4.8.1 (-O3) is unable to optimize the code: The loop is still there in the assembler, and contains additional statements to manage the work-sharing. (I cannot try it with VS at the moment.)

So, the comparison is not really fair, as the one without OpenMP gets optimized out completely.

Edit: You also have to keep in mind, that OpenMP doesn't re-create threads everytime. Rather it uses a thread-pool. So, if you execute an omp-construct before your loop, the threads will already be created when it encounters another one:

// Dummy loop: Spawn the threads.
#pragma omp parallel for
for(int i = 0; i < 4; i++){
}

int time_before = clock();

// Do the actual measurement. OpenMP re-uses the threads.
#pragma omp parallel for
for(int i = 0; i < 4; i++){
}

int time_after = clock();

In this case, the spikes should vanish.

I originally tested with contents inside the loop, and the random spikes were identical. Using Parallel Patterns Library's parallel_for, I get a speed boost without the random delays of OpenMP (however that library is only available on Windows), so the random spikes are not an intrinsic part of threads. — user3124047, Jun 29 '14 at 07:56
@user3124047 The library probably created the threads before your started your measurements. See my edit. — Sedenion, Jun 29 '14 at 09:27
It still takes time to break the work up into chunks, and hand to the threads; a sequential implementation has none of this extra work. So the running time of the body of the loop better exceed the overhead of partitioning/spinning work out to threads by a significant margin, or the parallel version *loses*. A loop with 4 iterations and an empty body is all overhead, and will always look terrible when parallelized. — Ira Baxter, Jun 30 '14 at 05:05

minjang · Answer 3 · 2014-07-14T20:34:12.970

If "OpenMP parallel spiking", which I would call "parallel overhead", is a concern in your loop, this infers you probably don't have enough workload to parallelize. Parallelization yields a speedup only if you have a sufficient problem size. You already showed an extreme example: no work in a parallelized loop. In such case, you will see highly fluctuating time due to parallel overhead.

The parallel overhead in OpenMP's omp parallel for includes several factors:

First, omp parallel for is the sum of omp parallel and omp for.
The overhead of spawning or awakening threads (many OpenMP implementations won't create/destroy every omp parallel.
Regarding omp for, overhead of (a) dispatching workloads to worker threads, (b) scheduling (especially, if dynamic scheduling is used).
The overhead of implicit barrier at the end of omp parallel unless nowait is specified.

FYI, in order to measure OpenMP's parallel overhead, the following would be more effective:

double measureOverhead(int tripCount) {
  static const size_t TIMES = 10000;
  int sum = 0;

  int startTime = clock();
  for (size_t k = 0; k < TIMES; ++k) {
    for (int i = 0; i < tripCount; ++i) {
      sum += i;
    }
  }
  int elapsedTime = clock() - startTime;

  int startTime2 = clock();
  for (size_t k = 0; k < TIMES; ++k) {
  #pragma omp parallel for private(sum) // We don't care correctness of sum 
                                        // Otherwise, use "reduction(+: sum)"
    for (int i = 0; i < tripCount; ++i) {
      sum += i;
    }
  }
  int elapsedTime2 = clock() - startTime2;

  double parallelOverhead = double(elapsedTime2 - elapsedTime)/double(TIMES);
  return parallelOverhead;
}

Try to run such small code may times, then take an average. Also, put at least minimum workload in loops. In the above code, parallelOverhead is an approximated overhead of OpenMP's omp parallel for construct.

@Massimiliano Thanks for pointing out. Yes, data race is in there. Since we don't care about the correctness of `sum`, either simply having a private copy of `sum` would be okay. Otherwise, of course, one should use `reduction`. — minjang, Jul 14 '14 at 20:35

OpenMP parallel spiking

3 Answers3

Linked