Q : How do I measure the actual computing time of each thread in such a setting?
A trivial MOCK-UP CODE for a simple, semi-manual code-execution time profiling:

It is needless to say, that for "noisy" execution-platform, the choice of the CLOCK_MONOTONIC
saves drifting time updates, yet does not "save" off-CPU-core wait-states, due to any (the more if heavy) "background"-(disturbing)-processes scheduled by the O/S.
Yet, for a prototyping phase, this is way easier than mounting all the "omp-native" callbacks'
{ ompt_callback_task_create_t, ompt_callback_task_schedule_t, ompt_callback_task_dependence_t, ompt_callback_dispatch_t, ompt_callback_sync_region_t, ..., ompt_callback_thread_begin_t, ompt_callback_thread_end_t, ... }
handling.
A SIDE-EFFECT BONUS :
The trivial code permits, if reporting and post-processing the recorded nested code-execution respective durations, to "frame" the hidden costs of the associated call-signatures' and recursion-nesting related overheads.
The revised, overhead-strict Amdahl's Law then stops lying you and starts to show you more precisely, when this code starts to lose on that very overhead-related ( plus due to a potential atomicity-of-work-unit(s) ) principally-[SERIAL]
-add-on-costs to any expected True-[PARALLEL]
-section(s) speedup ( expected from harnessing more (those and only those otherwise free) resources ).
That is always the hardest part of the War ( still to be fought ahead ... ).
EFFICIENCY of SCHEDULING & OCCUPIED RESOURCES' of a CALL to 2-ary task-SCHEDULED fun() with hidden 1-ary RECURSION:
CALL
42----*--------------------------------------------------------------------------------------*
: | |
: | 21----*---------------------------------------*
: | : | |
: | : | 10----*----------------*
: | : | : | |
: | : | : | 5----*----*
: | : | : | : | |
: | : | : | : | 2<
: | : | : | : 2< /
: | : | : 5----*----* 5___/___/................ #taskwait 2
: | : | : : | | /
: | : | : : | 2< /
: | : | : : 2< / /
: | : | : 5___/___/ /
: | : | 10___/____________/............................. #taskwait 5
: | : 10----*----------------* /
: | : : | | /
: | : : | 5----*----* /
: | : : | : | | /
: | : : | : | 2< /
: | : : | : 2< / /
: | : : 5----*----* 5___/___/ /
: | : : : | | / /
: | : : : | 2< / /
: | : : : 2< / / /
: | : : 5___/___/ / /
: | : 10___/____________/__________/.......................................................... #taskwait 10
: | 21___/
: 21----*---------------------------------------* /
: : | | /
: : | 10----*----------------* /
: : | : | | /
: : | : | 5----*----* /
: : | : | : | | /
: : | : | : | 2< /
: : | : | : 2< / /
: : | : 5----*----* 5___/___/ /
: : | : : | | / /
: : | : : | 2< / /
: : | : : 2< / / /
: : | : 5___/___/ / /
: : | 10___/____________/ /
: : 10----*----------------* / /
: : : | | / /
: : : | 5----*----* / /
: : : | : | | / /
: : : | : | 2< / /
: : : | : 2< / / /
: : : 5----*----* 5___/___/ / /
: : : : | | / / /
: : : : | 2< / / /
: : : : 2< / / / /
: : : 5___/___/ / / /
: : 10___/____________/__________/ /
: 21___/_______________________________________________________/...................................................................................................................... #taskwait 21
42___/
RET_/
EFFICIENCY
of SCHEDULING & OCCUPIED RESOURCES' of a CALL to 2-ary task
-SCHEDULED fun()
with hidden 1-ary RECURSION matters more and more for any growing scale of the workload
becoming soon workload < someBound * 2 ^ W
only at a cost of an awfully high W
( which causes W * k
-ary-many times re-{-acquired, -allocated, -released} Wasted all k
-ary-times requested #pragma omp task shared(...)
-handling-related resources, throughout the progression of the whole pure-[SERIAL]
-by-definition recursion dive-&-resurfacing back ).
It is easy to see how many resources will hang waiting ( due to even 1-ary RECURSION formulation ), until each and every dive to the deepest level of recursion bubbles back to the top level #pragma omp taskwait
.
The costs of re-allocating new and new resources for each recursion-dive level will most often kill you on overhead-strict Amdahl's Law ( performance-wise ), if not getting into thrashing or system-configuration related overflow, due to devastating the real-system physical resources earlier, for any reasonably large recursion-depths.
These are the costs you need not have paid, if not using a "typomatically-cheap"-yet-expensive-in-(idle/wasted)-resources recursive-problem-formulation, even with the most lightweight 1-ary cases.
See how many :
-denoted "waiting-lines" are there in parallel, besides how few |
-denoted "computing-lines" in either phase of the topology, which waste/block, yet must let idle, all task-related resources ( memory & stack-space are just the more visible ones that are performance-wise very expensive to acquire (to just let most of the processing time idle waiting) or prone to crash due to overflow if over-subscribed beyond the real system's configuration capacities ).
The War is yours! Keep Walking ...
The Site-Compliance Disclaimer :
------------------------------------------------------------------------------
As per StackOverflow policy, the full mock-up code is posted here, for any case the Godbolt.org platform might turn inaccessible, otherwise feel free to prefer and/or use the Compiler Explorer tools, that go way, way beyond the sequence-of-chars put into the mock-up source-code there
The choice & the joy from executing it is always yours :o)
#include <time.h>
int estimateWorkload() {
return 42; // _________________________________________________________ mock-up "workload"
}
int serial_a_bit_less_naive_factorial( int n ){
return ( n < 3 ) ? n : n * serial_a_bit_less_naive_factorial( n - 1 );
}
int serialFunction() {
return serial_a_bit_less_naive_factorial( 76 );
}
int parallelFunction( int workload, const int someBound ) { // __ pass both control parameters
struct timespec T0, T1;
int retFlag,
retValue,
result1,
result2;
retFlag = clock_gettime( CLOCK_MONOTONIC, &T0 ); // \/\/\/\/\/\/\/\/\/\ SECTION.begin
if ( workload < someBound ) {
retValue = serialFunction();
}
else { // -- [SEQ]----------------------------------------------------
#pragma omp task shared( result1 ) // -- [PAR]|||||||||||||||||||| with (1-ary recursions)
{
result1 = parallelFunction( (int) workload / 2, someBound ); // (int) fused DIV
}
#pragma omp task shared( result2 ) // -- [PAR]|||||||||||||||||||| with (1-ary recursions)
{
result2 = parallelFunction( (int) workload / 2, someBound ); // (int) fused DIV
}
#pragma omp taskwait
retValue = result1 < result2;
}
retFlag = clock_gettime( CLOCK_MONOTONIC, &T1 ); // \/\/\/\/\/\/\/\/\/\ SECTION.end
// ____________________________________________________________________ MAY ADD ACCUMULATION (1-ary recursions)
// ...
// ____________________________________________________________________ MAY ADD ACCUMULATION (1-ary recursions)
return retValue;
}
int main() {
const int someBound = 3; // _______________________________________ a control parameter A
#pragma omp parallel
{
#pragma omp master
{
// -- [SEQ]---------------------------------------- do some serial stuff
// ------------------------------estimate if parallelisation is worth it
const int workload = estimateWorkload();
if ( workload < someBound ) {
serialFunction();
}
else {
parallelFunction( workload, someBound ); // -- [PAR]||||||| with (1-ary recursions)
}
}
}
}