Speaking about performance?
Code can run way faster ~ +4x (!) for 1E6 loops
Independently of using or not the OpenMP-tools, lets start with these. The OpenMP thread-management ( instantiations, task-distribution, results-collection -( a smart reduction(+:sum2)
) **all comes at some add-on cost - see the amounts** (proportions) of the assembly instructions spent on this ).

Given your #pragma
-decorated code has paid all those add-on costs ( which it did, as instructed ) you gain almost nothing in return in exchange of the burnt expenses - but a reduction sum of 1E6 doubles
( 1E6 is as tiny as almost just a syntax-sugar, if compared to add-on costs-free pure-[SERIAL] code-execution, that sums the same in a snap ~ 18 [ms]
if smart ( even less than ~ 70 [ms]
if not ) as not burning add-on expenses on thread-management and task-distribution/result-collection overheads ( here ~ 400 [ms]
for a 2-core sandboxed demo test ),
0.01000
1000000
the true value is 0.500000 the number of threads we use is 2
OpenMP as-is Time: 0.467055
SERIAL as-is Time: 0.069820 <~~+ 70 [ms] @ 1E6 loops
OpenMP Re-DEF'd Time: 0.328225 | !!
SERIAL Re-DEF'd Time: 0.017899 <~~+~ 6x FASTER 18 [ms] @ 1E6 loops
Erratum : mea culpa - the code avoided one fDIV
for the bottom case ( re-tests show ~ +10% speedup - see the code )
Testing as low number of loops as 1E6 ( @-a-few-GHz-CPU-cores ... ) produces more noise than hard facts. Anyway, we can get faster in either of the strategies :
OpenMP as-is Time: 0.375825 the estimate value is 0.5000982178
SERIAL as-is Time: 0.062920 the estimate value is 0.5004906150
|||
~300 [us]FASTER--v
OpenMP Re-DEF'd Time: 0.062613 the estimate value is 0.4992401088
~2 [ms]SLOWER--^
||||
SERIAL Re-DEF'd Time: 0.060253 the estimate value is 0.4995912559
It is fair to note, that for longer looping the loop-incrementing overheads will generate more of the overall computing time, even with -O3
, so the re-factored code will exhibit all time fastest results, yet the speedup factor will grow smaller to ~ 1.16x
for 25E6
loops
The core flaw :
the awfully bad imbalance of costs:effects
hurts efficiency of any computation
There is actually almost nothing to compute ( a few fADD
-s, fMUL
, fNEG
, fDIV
-s, fLOG
) inside the most expensive syntax-constructor ( not mentioning the random ) that could never at least justify those costs, that have been spent on building the OpenMP code-execution eco-system (yet, we will show it could be even 6x reduced for FASTER runs ).
Why to ever re-calculate, the less do it MILLION+ TIMES a constant value?
So,
lets weed out things that ought never go into any performance motivated sections :
double C_LogRAND_MAX = log( (double) RAND_MAX );
double C_1divLENMUTA = -1 / lenmuta;
double C_2sub = C_LogRAND_MAX * C_1divLENMUTA;
and :
#pragma omp parallel for private( i, y ) reduction( +:sum2 )
for( i = 0; i < N; i++ )
{
sum2 += cos( C_1divLENMUTA // fMUL
* log( (double) rand() ) // +costs of keeping the seed-state management
- C_2sub // fSUB
);
}
Last but not least, a parallel-sourcing of random-sequences deserves another closer look, as these tools try to maintain its internal state which can make troubles "across" the threads. Good news is, that Stack Overflow can serve a lot on solving this performance hitting subject.
w/o -O3: =:_____________________________________________________________________:[ns]
SERIAL NOP Time: 3.867352 DIV( 2000000000 ) ~ 0.000000002 ~ 2 [ns]:||:for(){...}loop-overhead :
SERIAL RAND() Time: 10.845900 DIV( 1000000000 ) ~ 0.000000011 ~ +9 [ns]: |||||||||:rand() :
SERIAL (double) RAND() Time: 10.923597 DIV( 1000000000 ) ~ 0.000000011 ~ +0 [ns]: :(double) :
SERIAL LOG( (double) RAND() ) Time: 37.156017 DIV( 1000000000 ) ~ 0.000000037 ~ +27 [ns]: |||||||||||||||||||||||||||:log() :
SERIAL COS( LOG( (double) RAND() ) ) Time: 54.472115 DIV( 800000000 ) ~ 0.000000068 ~ +31 [ns]: |||||||||||||||||||||||||||||||:cos()
SERIAL COS( LOG( (double) RAND() ) ) Time: 55.320798 DIV( 800000000 ) ~ 0.000000069 : w/o -O3 :
w/-O3: :::( :::( (::::::) ::::() ) ) !!. :::( ::::::::: ) !! =:____________: :! :
SERIAL COS( LOG( (double) RAND() ) ) Time: 9.305908 DIV( 800000000 ) ~ 0.000000012 ~ 12 [ns]:|||||||||||| with -O3 :
SERIAL COS( LOG( (double) RAND() ) ) Time: 2.135143 DIV( 200000000 ) ~ 0.000000011 : :
SERIAL LOG( (double) RAND() ) Time: 2.082406 DIV( 200000000 ) ~ 0.000000010 : :
SERIAL (double) RAND() Time: 2.244600 DIV( 200000000 ) ~ 0.000000011
SERIAL RAND() Time: 2.101538 DIV( 200000000 ) ~ 0.000000011
SERIAL NOP Time: 0.000000 DIV( 200000000 ) ~ 0.000000000
^^
||
!||
vvv
OpenMP COS( LOG( (double) RAND() ) ) Time: 33.336248 DIV( 100000000 ) ~ 0.000000333 #pragma omp parallel num_threads( 2 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.388479 DIV( 1000000 ) ~ 0.000000388 #pragma omp parallel num_threads( 2 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 37.636114 DIV( 100000000 ) ~ 0.000000376 #pragma omp parallel num_threads( 2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 38.876272 DIV( 100000000 ) ~ 0.000000389 #pragma omp parallel num_threads( 2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 44.226391 DIV( 100000000 ) ~ 0.000000442 #pragma omp parallel num_threads( 2 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.333573 DIV( 1000000 ) ~ 0.000000334 #pragma omp parallel num_threads( 4 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 35.624111 DIV( 100000000 ) ~ 0.000000356 #pragma omp parallel num_threads( 4 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 37.820558 DIV( 100000000 ) ~ 0.000000378 #pragma omp parallel num_threads( 4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 38.625498 DIV( 100000000 ) ~ 0.000000386 #pragma omp parallel num_threads( 4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 39.782386 DIV( 100000000 ) ~ 0.000000398 #pragma omp parallel num_threads( 4 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.317120 DIV( 1000000 ) ~ 0.000000317 #pragma omp parallel num_threads( 8 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 34.692555 DIV( 100000000 ) ~ 0.000000347 #pragma omp parallel num_threads( 8 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.360407 DIV( 1000000 ) ~ 0.000000360 #pragma omp parallel num_threads( 8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 3.737517 DIV( 10000000 ) ~ 0.000000374 #pragma omp parallel num_threads( 8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.380087 DIV( 1000000 ) ~ 0.000000380 #pragma omp parallel num_threads( 8 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.354283 DIV( 1000000 ) ~ 0.000000354 #pragma omp parallel num_threads( 16 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 35.984292 DIV( 100000000 ) ~ 0.000000360 #pragma omp parallel num_threads( 16 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 3.654442 DIV( 10000000 ) ~ 0.000000365 #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 37.233374 DIV( 100000000 ) ~ 0.000000372 #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 4.112637 DIV( 10000000 ) ~ 0.000000411 #pragma omp parallel num_threads( 16 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 37.813872 DIV( 100000000 ) ~ 0.000000378 #pragma omp parallel num_threads( 32 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 0.412896 DIV( 1000000 ) ~ 0.000000413 #pragma omp parallel num_threads( 32 ) w/o
OpenMP COS( LOG( (double) RAND() ) ) Time: 34.098855 DIV( 100000000 ) ~ 0.000000341 #pragma omp parallel num_threads( 32 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 35.372660 DIV( 100000000 ) ~ 0.000000354 #pragma omp parallel num_threads( 32 ) with -O3
OpenMP COS( LOG( (double) RAND() ) ) Time: 39.256430 DIV( 100000000 ) ~ 0.000000393 #pragma omp parallel num_threads( 32 ) with -O3
/////////////////////////////////////////////////////////////////////////////////////////////
//
// -O3
// warning: iteration 2147483647 invokes undefined behavior [-Waggressive-loop-optimizations]
// for( i = 0; i < N; i++ )
// ^~~
********************************************************************************************/