Why a task run in OpenMP threads actually takes longer than in serial?

Question

I have writtem this code to estimate the value of an integral.

A straightforward and simple for()-loop in parallel, using openmp.

Whatever I do, I cannot reduce the running time in parallel to be less than in serial.

What is the problem?

lenmuta, tol, cores, seed are 1, 0.01, number_of_threads, 0 respectively.

Here is the code:

// ================= Libraries
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <sys/time.h>

int main( int argc, char* argv[] )
{
    float lenmuta = atof( argv[1] );
    float tol     = atof( argv[2] );
    int   cores   = atoi( argv[3] );
    int   seed    = atoi( argv[4] );

#define M  1 / ( tol*tol*0.01*lenmuta*lenmuta );

    unsigned int N = M;

    printf( "%10.5f \n", tol );
    printf(     "%d \n", N );

    omp_set_num_threads( cores );

    double sum2;
    int    Threadnum;
    float  rvalue;

    Threadnum = omp_get_max_threads();
    rvalue    = lenmuta / ( 1 + lenmuta * lenmuta );

    printf( "the true value is %f \n", rvalue );
    printf( "the number of threads we use is %d \n", Threadnum );

    struct random_data* state;
    double start1 = omp_get_wtime();
    int    k, i;
    double y;
    char   statebuf[32];

    state = malloc( sizeof( struct random_data ) );

    initstate_r( seed, statebuf, 32, state );
    srandom_r(   seed, state );

    // =========Parallel Region=======================

    #pragma omp parallel for private( i, y ) reduction( +:sum2 )
    for( i = 0; i < N; i++ )
    {
         y     = -( 1 / lenmuta ) * log( (double) rand() / ( (double) RAND_MAX ) );
         sum2 +=  cos( y );
    }
    sum2 = sum2 / N;

    printf( "Time: \t %f \n", omp_get_wtime() - start1 );
    printf( "the estimate value is %1.10f \n", sum2 );

    return 0;
    }

score 1 · Answer 1 · edited Nov 11 '20 at 19:45

Speaking about performance?
Code can run way faster ~ +4x (!) for 1E6 loops

Independently of using or not the OpenMP-tools, lets start with these. The OpenMP thread-management ( instantiations, task-distribution, results-collection -( a smart reduction(+:sum2) ) **all comes at some add-on cost - see the amounts** (proportions) of the assembly instructions spent on this ).

Given your #pragma-decorated code has paid all those add-on costs ( which it did, as instructed ) you gain almost nothing in return in exchange of the burnt expenses - but a reduction sum of 1E6 doubles ( 1E6 is as tiny as almost just a syntax-sugar, if compared to add-on costs-free pure-[SERIAL] code-execution, that sums the same in a snap ~ 18 [ms] if smart ( even less than ~ 70 [ms] if not ) as not burning add-on expenses on thread-management and task-distribution/result-collection overheads ( here ~ 400 [ms] for a 2-core sandboxed demo test ),

   0.01000 
1000000 
the true value is 0.500000      the number of threads we use is 2 

OpenMP as-is    Time:    0.467055     
SERIAL as-is    Time:    0.069820 <~~+            70 [ms] @ 1E6 loops
OpenMP Re-DEF'd Time:    0.328225    |            !!
SERIAL Re-DEF'd Time:    0.017899 <~~+~ 6x FASTER 18 [ms] @ 1E6 loops

_{Erratum : mea culpa - the code avoided one fDIV for the bottom case ( re-tests show ~ +10% speedup - see the code )}

Testing as low number of loops as 1E6 ( @-a-few-GHz-CPU-cores ... ) produces more noise than hard facts. Anyway, we can get faster in either of the strategies :

OpenMP as-is    Time:      0.375825     the estimate value is 0.5000982178 
SERIAL as-is    Time:      0.062920     the estimate value is 0.5004906150
                                |||
                               ~300 [us]FASTER--v
OpenMP Re-DEF'd Time:      0.062613     the estimate value is 0.4992401088
                              ~2    [ms]SLOWER--^
                               ||||
SERIAL Re-DEF'd Time:      0.060253     the estimate value is 0.4995912559

It is fair to note, that for longer looping the loop-incrementing overheads will generate more of the overall computing time, even with -O3, so the re-factored code will exhibit all time fastest results, yet the speedup factor will grow smaller to ~ 1.16x for 25E6 loops

The core flaw :
the awfully bad imbalance of costs:effects
hurts efficiency of any computation

There is actually almost nothing to compute ( a few fADD-s, fMUL, fNEG, fDIV-s, fLOG ) inside the most expensive syntax-constructor ( not mentioning the random ) that could never at least justify those costs, that have been spent on building the OpenMP code-execution eco-system (yet, we will show it could be even 6x reduced for FASTER runs ).

Why to ever re-calculate, the less do it MILLION+ TIMES a constant value?

So, lets weed out things that ought never go into any performance motivated sections :

double C_LogRAND_MAX = log( (double) RAND_MAX );
double C_1divLENMUTA = -1 / lenmuta;
double C_2sub        = C_LogRAND_MAX * C_1divLENMUTA;

and :

#pragma omp parallel for private( i, y ) reduction( +:sum2 )
for( i = 0; i < N; i++ )
{
     sum2 +=  cos( C_1divLENMUTA           // fMUL
                 * log( (double) rand() )  // +costs of keeping the seed-state management 
                 - C_2sub                  // fSUB
                   );
}

Last but not least, a parallel-sourcing of random-sequences deserves another closer look, as these tools try to maintain its internal state which can make troubles "across" the threads. Good news is, that Stack Overflow can serve a lot on solving this performance hitting subject.

w/o -O3:                                                                                               =:_____________________________________________________________________:[ns]
SERIAL                     NOP       Time:     3.867352 DIV( 2000000000 ) ~   0.000000002     ~   2 [ns]:||:for(){...}loop-overhead                                           :
SERIAL                    RAND()     Time:    10.845900 DIV( 1000000000 ) ~   0.000000011     ~  +9 [ns]:  |||||||||:rand()                                                   :
SERIAL           (double) RAND()     Time:    10.923597 DIV( 1000000000 ) ~   0.000000011     ~  +0 [ns]:           :(double)                                                 :
SERIAL      LOG( (double) RAND() )   Time:    37.156017 DIV( 1000000000 ) ~   0.000000037     ~ +27 [ns]:           |||||||||||||||||||||||||||:log()                         :
SERIAL COS( LOG( (double) RAND() ) ) Time:    54.472115 DIV(  800000000 ) ~   0.000000068     ~ +31 [ns]:                                      |||||||||||||||||||||||||||||||:cos()
SERIAL COS( LOG( (double) RAND() ) ) Time:    55.320798 DIV(  800000000 ) ~   0.000000069               :                        w/o  -O3                                     :
w/-O3: :::( :::( (::::::) ::::() ) )          !!.       :::(  ::::::::: )              !!              =:____________:           :!                                           :
SERIAL COS( LOG( (double) RAND() ) ) Time:     9.305908 DIV(  800000000 ) ~   0.000000012     ~  12 [ns]:||||||||||||            with -O3                                     :
SERIAL COS( LOG( (double) RAND() ) ) Time:     2.135143 DIV(  200000000 ) ~   0.000000011               :                                                                     :                                                                       
SERIAL      LOG( (double) RAND() )   Time:     2.082406 DIV(  200000000 ) ~   0.000000010               :                                                                     :                                                                       
SERIAL           (double) RAND()     Time:     2.244600 DIV(  200000000 ) ~   0.000000011
SERIAL                    RAND()     Time:     2.101538 DIV(  200000000 ) ~   0.000000011
SERIAL                     NOP       Time:     0.000000 DIV(  200000000 ) ~   0.000000000
                                                                                       ^^
                                                                                       ||
                                                                                      !||
                                                                                      vvv
OpenMP COS( LOG( (double) RAND() ) ) Time:    33.336248 DIV(  100000000 ) ~   0.000000333  #pragma omp parallel num_threads(  2 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:     0.388479 DIV(    1000000 ) ~   0.000000388  #pragma omp parallel num_threads(  2 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    37.636114 DIV(  100000000 ) ~   0.000000376  #pragma omp parallel num_threads(  2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    38.876272 DIV(  100000000 ) ~   0.000000389  #pragma omp parallel num_threads(  2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    44.226391 DIV(  100000000 ) ~   0.000000442  #pragma omp parallel num_threads(  2 ) with -O3

OpenMP COS( LOG( (double) RAND() ) ) Time:     0.333573 DIV(    1000000 ) ~   0.000000334  #pragma omp parallel num_threads(  4 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    35.624111 DIV(  100000000 ) ~   0.000000356  #pragma omp parallel num_threads(  4 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    37.820558 DIV(  100000000 ) ~   0.000000378  #pragma omp parallel num_threads(  4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    38.625498 DIV(  100000000 ) ~   0.000000386  #pragma omp parallel num_threads(  4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    39.782386 DIV(  100000000 ) ~   0.000000398  #pragma omp parallel num_threads(  4 ) with -O3

OpenMP COS( LOG( (double) RAND() ) ) Time:     0.317120 DIV(    1000000 ) ~   0.000000317  #pragma omp parallel num_threads(  8 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    34.692555 DIV(  100000000 ) ~   0.000000347  #pragma omp parallel num_threads(  8 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:     0.360407 DIV(    1000000 ) ~   0.000000360  #pragma omp parallel num_threads(  8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:     3.737517 DIV(   10000000 ) ~   0.000000374  #pragma omp parallel num_threads(  8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:     0.380087 DIV(    1000000 ) ~   0.000000380  #pragma omp parallel num_threads(  8 ) with -O3

OpenMP COS( LOG( (double) RAND() ) ) Time:     0.354283 DIV(    1000000 ) ~   0.000000354  #pragma omp parallel num_threads( 16 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    35.984292 DIV(  100000000 ) ~   0.000000360  #pragma omp parallel num_threads( 16 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:     3.654442 DIV(   10000000 ) ~   0.000000365  #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    37.233374 DIV(  100000000 ) ~   0.000000372  #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:     4.112637 DIV(   10000000 ) ~   0.000000411  #pragma omp parallel num_threads( 16 ) with -O3

OpenMP COS( LOG( (double) RAND() ) ) Time:    37.813872 DIV(  100000000 ) ~   0.000000378  #pragma omp parallel num_threads( 32 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:     0.412896 DIV(    1000000 ) ~   0.000000413  #pragma omp parallel num_threads( 32 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time:    34.098855 DIV(  100000000 ) ~   0.000000341  #pragma omp parallel num_threads( 32 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    35.372660 DIV(  100000000 ) ~   0.000000354  #pragma omp parallel num_threads( 32 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time:    39.256430 DIV(  100000000 ) ~   0.000000393  #pragma omp parallel num_threads( 32 ) with -O3


/////////////////////////////////////////////////////////////////////////////////////////////
//
// -O3
// warning: iteration 2147483647 invokes undefined behavior [-Waggressive-loop-optimizations]
//     for( i = 0; i < N; i++ )
//     ^~~
********************************************************************************************/

Since you mentioned parallel random number generation... I just answered a question on exactly that https://stackoverflow.com/questions/60292324/how-to-generate-uniformly-distributed-random-numbers-between-0-and-1-in-a-c-code/60296481#60296481 — Jim Cownie, Feb 20 '20 at 08:49
Salutes to Bristol @JimCownie ( *a deep fan of the innovative UK Transputer generation + transputer hardware supported true-**`[PARALLEL]` occam** language here* ) - it surprised me that the proposed Intel MKL tool enjoys **periodicity *as short as 2^128*** when there are about a quarter of century old & stable, computationally cheap ( read non-sharing + non-blocking ) PRNGs having the same **way above 2^(318+)** i.e. way above 1E+57 times longer? What is in your experience the core advantage of using the proposed MKL that I have perhaps missed here? — user3666197, Feb 20 '20 at 11:04
Thanks for the thumbs up. The major benefit of the algorithms from the "Parallel Random Number Generation as Easy as 1,2,3" paper is that you can explicitly choose segments of the random number sequence which can be guaranteed to be a specific distance from each other. Thus you know *for sure* that in your parallel program, where you have many threads generating random numbers, the sequences they generate will not overlap. — Jim Cownie, Feb 21 '20 at 08:45
*( well, unless one consumes those but 2^128 members of the state-space & must start repeating 'em in the loop )* - thanks for the remark, @JimCownie — user3666197, Feb 21 '20 at 13:45
True, but, even with 1Mi threads (2**20) you can still generate 2**108 numbers in each thread before that happens, which, if you are generating them at the rate of one/ns will take 3.25e41 seconds or ~ 1e34 years, which seems reasonably safe given the expected remaining lifetime of the sun is only 5e9 years... — Jim Cownie, Feb 24 '20 at 08:58
**This deserves a `+1`**, @JimCownie - thx for reminding the limited time-span we live in. While it goes well beyond the scope of this post, the beauty of a fair Randomness is, it shall never(be it within a bounded time-span or not)come any close to exhausting the principal state-space. We evolve & keep by the very same principle a by far amount an excess of the expressivity(richness)of mathematical apparatus with the ***same* certainty** that there are not(*known so far*)resources,that could ever materialise its express-able richness, which does not restrict the need for that richfullness :o) — user3666197, Feb 24 '20 at 16:12

Why a task run in OpenMP threads actually takes longer than in serial?

1 Answers1

Speaking about performance? Code can run way faster ~ +4x (!) for 1E6 loops

The core flaw :the awfully bad imbalance of costs:effectshurts efficiency of any computation

Why to ever re-calculate, the less do it MILLION+ TIMES a constant value?

Speaking about performance?
Code can run way faster ~ +4x (!) for 1E6 loops

The core flaw :
the awfully bad imbalance of costs:effects
hurts efficiency of any computation