Here is my attempt to benchmark the performance of Intel TBB flow graph. Here is the setup:
- One broadcast node sending
continue_msg
toN
successor nodes
( abroadcast_node<continue_msg>
) - Each successor node performs a computation that takes
t
seconds. - The total computation time when performed serially is
Tserial = N * t
- The ideal computation time, if all cores are used, is
Tpar(ideal) = N * t / C
, whereC
is the number of cores. - The speedup is defined as
Tpar(actual) / Tserial
- I tested the code with gcc5 on a 16 core PC.
Here are the results showing the speedup as a function of the processing time of individually task ( i.e. body ):
t = 100 microsecond , speed-up = 14
t = 10 microsecond , speed-up = 7
t = 1 microsecond , speed-up = 1
As can been for lightweight tasks ( whose computation takes less than 1 microseconds ), the parallel code is actually slower that the serial code.
Here are my questions:
1 ) Are these results inline with Intel TBB benchmarks?
2 ) It there a better paradigm, than a flow graph for the case, when there are thousands of tasks each taking less than 1 microsecond?