OpenMP worst performance with more threads (following openMP tutorials)

Question

I'm starting to work with OpenMP and I follow these tutorials:

I'm coding exactly what appears on the video, but instead of a better performance with more threads I get worse. I don't understand why.

Here's my code:

#include <iostream>
#include <time.h>
#include <omp.h>

using namespace std;

static long num_steps = 100000000;
double step;

#define NUM_THREADS 2

int main()
{
    clock_t t;
    t = clock();
    int i, nthreads; double pi, sum[NUM_THREADS];
    step = 1.0/(double)num_steps;

    omp_set_num_threads(NUM_THREADS);
    #pragma omp parallel
    {
        int i, id, nthrds;
        double x;
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();
        if(id == 0) nthreads = nthrds;
        for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
        {
            x = (i+0.5)*step;
            sum[id] += 4.0/(1.0+x*x);
        }
    }
    for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;

    t = clock() - t;
    cout << "time: " << t << " miliseconds" << endl;

}

As you can see, it's exactly the same as in the video, I only added a code to measure an elapsed time.

On the tutorial, the more threads we use the better a performance.

In my case, that doesn't happen. Here are the timing I got:

1 thread:   433590 miliseconds
2 threads: 1705704 miliseconds
3 threads: 2689001 miliseconds
4 threads: 4221881 miliseconds

Why do I get this behavior?

-- EDIT --

gcc version: gcc 5.5.0

result of lscpu:

Architechure: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4720HQ CPU @ 2.60Ghz
Stepping: 3
CPU Mhz: 2594.436
CPU max MHz: 3600,0000
CPU min Mhz: 800,0000
BogoMIPS: 5188.41
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7

-- EDIT --

I've tried using omp_get_wtime() instead, like this:

#include <iostream>
#include <time.h>
#include <omp.h>

using namespace std;

static long num_steps = 100000000;
double step;

#define NUM_THREADS 8

int main()
{
    int i, nthreads; double pi, sum[NUM_THREADS];
    step = 1.0/(double)num_steps;
    double start_time = omp_get_wtime();

    omp_set_num_threads(NUM_THREADS);
    #pragma omp parallel
    {
        int i, id, nthrds;
        double x;
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();
        if(id == 0) nthreads = nthrds;
        for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
        {
            x = (i+0.5)*step;
            sum[id] += 4.0/(1.0+x*x);
        }
    }
    for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
    double time = omp_get_wtime() - start_time;

    cout << "time: " << time << " seconds" << endl;

}

The behavior is different, although I have some questions.

Now, if I increase the number of threads by 1, for example, 1 thread, 2 threads, 3, 4, ..., the results are basically the same as previous, the performance gets worse, although if I increase to 64 threads, or 128 threads I get indeed better performance, the timing decreases from 0.44 [s] (for 1 thread) to 0.13 [s] ( for 128 threads ).

My question is: Why I don't have the same behaviour as in the tutorial?

2 threads get better performance than 1,
3 threads get better performance than 2, etc.

Why do I only get better performance with much bigger amount of threads?

Your computer specification information might be useful here. — Sam Orozco, May 08 '18 at 18:18
As you are totalling up the time spent by all threads, you would expect the time to increase, particularly as you didn't specify pragma omp for to split the loop among threads rather than making each thread repeat the entire job. Unless you distribute threads 1 per core (you didn't tell anything about your platform), you can't expect much benefit even in terms of elapsed time. — tim18, May 08 '18 at 18:58
Don't use `clock` for timing OpenMP programs. This is an issue covered in multiple questions here on SO, wait a while and I'll find one. — High Performance Mark, May 08 '18 at 19:19
Please check my edit to the post. I now used omp_get_wtime() but I some questions abuot the results — César Pereira, May 09 '18 at 14:34
Although you are using the same source, there could be different reasons why your code is not performing exactly was you would expect. That's part of learning OpenMP: Learn how to tune it. Several things you could do: Compile with and w/o optimizations; run the binary several times and take the average; use OpenMP affinity. You should see some changes in the results. — jandres742, May 12 '18 at 00:21
As you have just 4 cores, you still have the question of what advantage do you expect in running multiple copies of the same calculation rather than dividing it into chunks evenly distributed across 4 cores. Surely you could find a more normal implementation of the same toy program if you aren't willing to think about it yourself. Assuming you aren't using simd vector reduction in the inner loop to optimize single thread performance, it should be easy to get parallel speedup, not that it would be meaningful. — tim18, May 16 '18 at 11:47

user3666197 · Answer 1 · 2018-05-15T21:59:02.047

instead of better performances with more threads I get worse ... I don't understand why.

Well,
let's make the testing a bit more systematic and repeatable
to see if :

// time: 1535120 milliseconds    1 thread
// time:  200679 milliseconds    1 thread  -O2  
// time:  191205 milliseconds    1 thread  -O3
// time:  184502 milliseconds    2 threads -O3
// time:  189947 milliseconds    3 threads -O3 
// time:  202277 milliseconds    4 threads -O3 
// time:  182628 milliseconds    5 threads -O3
// time:  192032 milliseconds    6 threads -O3
// time:  185771 milliseconds    7 threads -O3
// time:  187606 milliseconds   16 threads -O3
// time:  187231 milliseconds   32 threads -O3
// time:  186131 milliseconds   64 threads -O3

ref.: a few sample runs on a TiO.RUN platform fast mock-up ... where limited resources apply a certain glass-ceiling to hit...

This did show more the effects of { -O2 |-O3 }-compilation-mode optimisation effects, than the above proposed principal degradation for growing number of threads.

Next comes the "background" noise from non-managed code-execution ecosystem, where O/S will easily skew the simplistic performance benchmarking

If indeed interested in further details, feel free to read about a Law of diminishing returns ( about real world compositions of [SERIAL], resp. [PARALLEL] parts of the process-scheduling ), where Dr. Gene AMDAHL has initiated the principal rules, why more threads do not get way better performance ( and where a bit more contemporary re-formulation of this law explains, why more threads may even get negative improvement ( get more expensive add-on overheads ), than a right-tuned peak performance.

#include <time.h>
#include <omp.h>

#include <stdio.h>
#include <stdlib.h>

using namespace std;

static long   num_steps = 100000000;
       double step;

#define NUM_THREADS 7

int main()
{
    clock_t t;
    t = clock();

    int i, nthreads; double pi, sum[NUM_THREADS];
    step = 1.0 / ( double )num_steps;

    omp_set_num_threads( NUM_THREADS );

 // struct timespec                  start;
 // t = clock(); // _________________________________________ BEST START HERE
 // clock_gettime( CLOCK_MONOTONIC, &start ); // ____________ USING MONOTONIC CLOCK
    #pragma omp parallel
    {
        int    i,
               nthrds = omp_get_num_threads(),
               id     = omp_get_thread_num();;
        double x;

        if ( id == 0 ) nthreads = nthrds;

        for ( i =  id, sum[id] = 0.0;
              i <  num_steps;
              i += nthrds
              )
        {
            x = ( i + 0.5 ) * step;
            sum[id] += 4.0 / ( 1.0 + x * x );
        }
    }

 // t = clock() - t; // _____________________________________ BEST STOP HERE
 // clock_gettime( CLOCK_MONOTONIC, &end ); // ______________ USING MONOTONIC CLOCK
    for ( i =  0, pi = 0.0;
          i <  nthreads;
          i++
          ) pi += sum[i] * step;

    t = clock() - t;
 //                                                  // time: 1535120 milliseconds    1 thread
 //                                                  // time:  200679 milliseconds    1 thread  -O2  
 //                                                  // time:  191205 milliseconds    1 thread  -O3
    printf( "time: %d milliseconds %d threads\n",    // time:  184502 milliseconds    2 threads -O3
             t,                                      // time:  189947 milliseconds    3 threads -O3 
             NUM_THREADS                             // time:  202277 milliseconds    4 threads -O3 
             );                                      // time:  182628 milliseconds    5 threads -O3
}                                                    // time:  192032 milliseconds    6 threads -O3
                                                     // time:  185771 milliseconds    7 threads -O3

score 0 · Answer 2 · answered May 16 '18 at 12:45

The major problem in that version is false sharing. This is explained later in the video you started to watch. You get this when many threads are accessing data that is adjacent in memory (the sum array). The video also explains how to use padding to manually avoid this issue.

That said, the idiomatic solution is to use a reduction and not even bother with the manual work sharing:

double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < num_steps; i++)
{
    double x = (i+0.5)*step;
    sum += 4.0/(1.0+x*x);
}

This is also explained in a later video of the series. It is much simpler than what you started with and most likely the most efficient way.

Although the presenter is certainly competent, the style of these OpenMP tutorial videos is very much bottom up. I'm not sure that is a good educational approach. In any case you should probably watch all of the videos to know how to best use OpenMP it in practice.

Why do I only get better performance with much bigger amount of threads?

This is a bit counterintuitive, you very rarely get better performance from using more OpenMP threads than hardware threads - unless this is indirectly fixing another issue. In your case the large amount of threads means that the sum array is spread out over a larger region in memory and false-sharing is less likely.

In case your compiler doesn't perform scalar replacement optimization in the inner loop, the recommendation above takes care that no false sharing occurs in the inner loop. Your CPU ought to be capable of more speedup from single thread simd reduction than from multiple threads. Achieving both together in case the loop is long enough to benefit probably requires explicit 2 level looping — tim18, May 16 '18 at 13:37
@tim18 Looks [reasonably SIMD](https://godbolt.org/g/Ek4pTn) to me with and without `-fopenmp`. Why wouldn't it? I do not recommend doing any manual loop optimization shenanigans unless there's hard evidence that it is beneficial and relevant in the particular scenario. — Zulan, May 16 '18 at 13:46

OpenMP worst performance with more threads (following openMP tutorials)

2 Answers2

Well,let's make the testing a bit more systematic and repeatable to see if :

Well,
let's make the testing a bit more systematic and repeatable
to see if :