1

I'm trying to benchmark computing an f(x) while varying the number of threads with every iteration.

f(x) = c * ln(x) * cos(x)

n=10000000


for (int pp = 2; pp<17; pp++)
{
    p = pp;
    int chunk = n/p; //acts like floor
    omp_set_num_threads(p);
    double start_parallel = omp_get_wtime();
    //start parallel
    #pragma omp parallel shared(tt,chunk) private (i)
    {
        //printf("thread number %d\n",omp_get_thread_num());
        #pragma omp for schedule(dynamic,chunk) nowait
        for(i=0; i<n; i++)
        {
            //tt[i] = f(tt[i]);
            tt[i] = f1(tt[i]); //the speed up is much higher with f1 since log and cos 
                               //computations are polynomial; see function.
        }
    } //end parallel
    double end_parallel = omp_get_wtime();
    double cpu_time_used_parallel = (double) (end_parallel - start_parallel);
    printf("parallel: for n=%d, p=%d, time taken=%f, speedup=%f\n",
            n,p,cpu_time_used_parallel,
            cpu_time_used_seq/cpu_time_used_parallel);
}

Result:

Started varying threads:

parallel: for n=10000000, p=2, time taken=0.153774, speedup=3.503831

parallel: for n=10000000, p=3, time taken=0.064447, speedup=8.360370

parallel: for n=10000000, p=4, time taken=0.044694, speedup=12.055239

parallel: for n=10000000, p=5, time taken=0.048700, speedup=11.063550

parallel: for n=10000000, p=6, time taken=0.039009, speedup=13.811989

parallel: for n=10000000, p=7, time taken=0.041735, speedup=12.910017

parallel: for n=10000000, p=8, time taken=0.041268, speedup=13.055919

parallel: for n=10000000, p=9, time taken=0.039032, speedup=13.804157

parallel: for n=10000000, p=10, time taken=0.038970, speedup=13.825767

parallel: for n=10000000, p=11, time taken=0.039843, speedup=13.522884

parallel: for n=10000000, p=12, time taken=0.041356, speedup=13.028237

parallel: for n=10000000, p=13, time taken=0.041039, speedup=13.128763

parallel: for n=10000000, p=14, time taken=0.047433, speedup=11.359218

parallel: for n=10000000, p=15, time taken=0.048430, speedup=11.125202

parallel: for n=10000000, p=16, time taken=0.051950, speedup=10.371477


Note: The speedup here is computed against the sequential algorithm (threads = 1)

The speedup does not seem to be really affected by the variation of p (number of threads).

Am I doing this right, or the cause comes from the non efficient incrementation of the number of threads (i.e. theoretically speaking changing p won't seriously affect O(myprogram) ) ?

user3666197
  • 1
  • 6
  • 50
  • 92
Oussama Ben Ghorbel
  • 2,132
  • 4
  • 17
  • 34

2 Answers2

3

Q : Am I doing this right...?
A : No, sorry, you do not happen to do this right. Let's analyse the reasons and let's sketch some hints for HPC-grade performance benchmarking together :

Well, you already know the initial benchmark design was not well crafted. The effect of add-on costs, demonstrated in the contemporary criticism of the original, overhead-naive Amdahl-law has more details on this and timing details show this very well, as instantiation costs grew smaller and smaller to other classes of processing and I/O-related costs, as shown below.

enter image description here

Reasons?

The code has immense add-on overhead costs compared to the "computation"-part. This is the strongest marker of a flawed use of technology capabilities of an otherwise legal syntax-constructor ( OpenMP here, map-reduce somewhere else, list-comprehension in other cases or some other syntax-suger tricks elsewhere else )

An ultimate performance is an Art of Balancing ( right - balancing the Costs and Benefits - any sort of an imbalance means a lost performance edge ).


Costs? Here? Look at the landscape of results ::

The second sin was to ignore the "behind-scene" [TIME]-domain penalty for [SPACE]-domain scaling. The larger the n, the larger memory was to get allocated and the more penalties came in from ( horrifying ) cache-line inefficiencies, all that having zero-protection from falling into the hell-of-memory swapping:

n=1E3        1E4        1E5        1E6        1E7        1E8        1E9       :
______________________________________________________________________________:_____
1.000      1.000      1.000      1.000      1.000      1.000      1.000       : p= 1
0.930      1.403      0.902      1.536      1.492      1.517      0.356       : p= 2
1.075      2.319      2.207      1.937      2.001      1.991      1.489++     : p= 3
1.497+++++ 2.636++++  1.563      1.657      2.571      2.144      0.687       : p= 4
1.226++    2.548+++   0.957      2.025      2.357      1.731      1.569++++   : p= 5
1.255+++   1.805      2.704      2.020      2.348      1.502      0.989       : p= 6
0.957      0.581      3.104++    2.124      2.486      2.002      0.838       : p= 7
1.151      1.376      2.449      2.154      2.573      1.536      0.776       : p= 8
1.135      1.685      2.388      2.506+++   2.852++++  2.311      1.676+++++  : p= 9
1.285++++  2.492++    2.497      2.568++++  2.647+     2.467      1.413+      : p=10
1.177      2.314+     2.709+     2.174      2.688+++   2.634++++  0.606       : p=11
1.216+     2.293      2.442      2.287      2.550      2.551++    1.256       : p=12
1.034      2.148      1.802      2.361++    2.635      2.554+++   1.181       : p=13
0.999      0.440      3.672+++++ 2.774+++++ 2.927+++++ 2.839+++++ 1.496+++    : p=14
1.091      1.217      3.285++++  2.284      2.525      2.356      1.005       : p=15
0.937      2.850+++++ 3.185+++   2.334+     2.655++    2.508+     0.889       : p=16

Tips for improved performance ?

  • If benchmarking computation, benchmark the computation and avoid MEM-I/O

  • If benchmarking, always check the whole landscape, to feel ( better learn how to avoid/evict any such from skewing the measured results ) the in-cache side-effects

  • Always avoid any form of sharing ( may be avoided - map the processing over disjunct areas of the tt[], yet covering "whole" tt[], in cache-line coherent blocks, so as to avoid false-sharing and "at least" re-use any data from the already fetched data-blocks that you have paid the costs of MEM-I/O fetching already ( if the vector-based LOAD/STORE is indeed a must - see above )

  • Permit and actively harness any HPC-vectorisation tricks available with FMA4 / SIMD / AVX512

user3666197
  • 1
  • 6
  • 50
  • 92
  • This is an outstanding answer, plenty of good points over there. "If benchmarking computation, benchmark the computation and avoid MEM-I/O" yes I figured out this myself yesterday thank you for pointing it out! While I'm still new to openmp and parallel computing so most of this does not make sense tbh. Yet thank you so much for enlightening me with such amount of information. – Oussama Ben Ghorbel Oct 07 '19 at 08:57
  • 2
    Nice answer. There's one more thing which you missed. Starting a new thread team after changing the number of OpenMP threads is an expensive operation when compared with starting a parallel region with the same number of threads as before. Since real codes do not (or, at least, should not) be changing the degree of parallelism the benchmark should start an empty, untimed, warm-up parallel region before running the timed one. – Jim Cownie Oct 07 '19 at 09:09
  • @JimCownie Nice talk at the SC-18, cool to read about the MPAS-Ocean model ( btw. how well does it fit into the observed reality of metastable phenomena, like the fidelity of emulating the "long" time-scale ( compared to dt in a [TIME]-domain step of the simulation ) macro-scale ( a few thousands square kilometers ) re-configurations of the sea-flows between two, metastable states, named Kuroshio and Oyashio respectively? **If this level of detail was achieved in that high-fidelity modeling, the DoE teams deserve utmost admiration**, my hat is raised for supporting these efforts, Sir. – user3666197 Oct 07 '19 at 09:51
  • @user366197 For MPAS science details talk to the real scientists; I am just the computer scientist, language lawyer, and runtime implementor :-) – Jim Cownie Oct 08 '19 at 08:30
  • @JimCownie You got me, Sir, with serving a principally Finite State Automaton domain ( no matter how rich, yet finite ), as a "Syntax Lawyer" :o) All the best, Jim and thank you for adding your remarks on how to avoid add-on costs of omp-threads (re)instantiations from skewing the timed-sections of OpenMP code above. OpenMP standard does not specify the underlying execution vehicle;that is exclusively an implementation issue.An OpenMP implementation may choose to provide this abstraction in any of multiple ways i.e. one implementation may map an OpenMP thread onto an O/S process, other Pthread – user3666197 Oct 08 '19 at 08:42
  • @user3666197 If that takes you to Denver next month for SC19, I will be around and can be found via the OpenMP booth. (We're going way off topic here, but I don't see a way to send private messages...) – Jim Cownie Oct 08 '19 at 08:45
  • Guys just one more detai, in the loop where I'm varying the number of threads, the first iteration is +- decent benchmarking, the rest is WRONG. Why? Cause there is a value fetching from cache. To purge the cache: I added a loop where I put random values again to the array and the speed up went way down to 2~3. – Oussama Ben Ghorbel Oct 08 '19 at 09:53
  • 1
    Just one more detail, feel free to **open a *new* question - StackOverflow strongly discourages from creeping further** the O/P definition or adding just one more and one more and another one and finally a one last question. Contributing **Community members are always welcome** to post a new, **MCVE**-formulated problem with all relevant details about ***WHYTSF***- - -**W**hat **H**ave **Y**ou **T**ried **S**o **F**ar, perhaps with a live-code session state URL on an IDE like https://tio.run/#c-gcc or https://godbolt.org/z/7Amgc9. Looking forward to see new inspirations of HPC in action – user3666197 Oct 08 '19 at 11:05
0

Your results are not correct because of the nowait clause in the omp pragma. Normally, an end of a parallel represents a synchronization point, but with no wait, the first thread that finishes will not wait for the other threads, so you will execute double end_parallel = omp_get_wtime(); while other threads are still doing some computing. If you take out the nowait, you should observe completely different results.

user3666197
  • 1
  • 6
  • 50
  • 92
coyotemk
  • 133
  • 9
  • I'm not sure about this, the synchronization happens at the end of the first directive omp parallel. No thread other than main reaches the end_time by itself, only the main thread gets there and cumputes the time? – Oussama Ben Ghorbel Oct 07 '19 at 08:54