Intel TBB flowgraph overhead

Question

Here is my attempt to benchmark the performance of Intel TBB flow graph. Here is the setup:

One broadcast node sending continue_msg to N successor nodes
( a broadcast_node<continue_msg> )
Each successor node performs a computation that takes t seconds.
The total computation time when performed serially is Tserial = N * t
The ideal computation time, if all cores are used, is Tpar(ideal) = N * t / C, where C is the number of cores.
The speedup is defined as Tpar(actual) / Tserial
I tested the code with gcc5 on a 16 core PC.

Here are the results showing the speedup as a function of the processing time of individually task ( i.e. body ):

t = 100 microsecond  ,   speed-up =  14
t  = 10 microsecond  ,   speed-up =   7
t  =  1 microsecond  ,   speed-up =   1

As can been for lightweight tasks ( whose computation takes less than 1 microseconds ), the parallel code is actually slower that the serial code.

Here are my questions:

1 ) Are these results inline with Intel TBB benchmarks?
2 ) It there a better paradigm, than a flow graph for the case, when there are thousands of tasks each taking less than 1 microsecond?

Richard · Accepted Answer · 2018-01-03T18:07:13.690

The Overhead of Parallelism

Your cost model is wrong.

The ideal parallel computation time is:

Tpar(ideal) = N * t / C + Pstart + Pend

where Pstart is how long it takes to start your parallelism and Pend is the time taken to finish it. It's not unusual for Pstart to be on the order of tens of milliseconds.

I'm not sure if you're familiar with OpenMP (though it's a good thing to know), but, for our purposes it's a threading model that divides work between a team of threads. The following chart shows the overhead of some of the commands in relation to the size of the thread team:

The takeaway is that getting your parallelism going (the parallel for line) is potentially expensive and grows with the number of threads. Ending parallelism (the barrier line) has similar costs.

In fact, if you look at TBB's tutorial, Section 3.2.2 ("Automatic Chunking") says:

CAUTION: Typically a loop needs to take at least a million clock cycles for parallel_for to improve its performance. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.

Achieving Better Speed

The best way to speed up code like this is to only perform the operations in parallel if there are a large number of them and/or to adjust the number of threads doing the work so that each thread has plenty to do. In TBB you could achieve similar behaviour like so:

#include <tbb/parallel_for.h>

// . . .
if(work_size>1000)
  tbb::serial::parallel_for( . . . );
else
  tbb::parallel_for( . . . );
// . . .

where you'd want to adjust 1000 to a number high enough that you start to see gains from parallelism.

You could also reduce the number of threads, since this reduces the overhead somewhat:

tbb::task_scheduler_init init(nthread);

TBB also performs dynamic load balancing (see here). This is great if you expect your loop iterations/tasks to have a broad distribution of run-times, but represents unnecessary overhead if the expected run-times are the same. I'm not sure if TBB has static scheduling, but it may be worth looking into.

In case people end up here without a strong commitment to TBB, in OpenMP you'd do something like:

#pragma omp parallel for if(work_size>1000) num_threads(4) schedule(static)

@Richard Your answer is valid in the context of the parallel_for, when chunking can be easily controlled by changing the grain size. However, if we have a computation graph with multiple nodes whose computation times vary (but known), an extra layer of optimization/scheduling is needed to decide at each phase of the graph, how much chunking should be done. I think Intel TBB flow-graph is not suitable for massive computation graphs with micro-payloads. — motam79, Jan 03 '18 at 19:52
It is like "overhead-aware" multi-threaded DAG scheduling. I am not sure if there are any libraries/tools that can do this. — motam79, Jan 03 '18 at 19:56
@motam79: I'm actually working on problems a little like this at the moment. If you can structure your DAG as a breadth-first traversal, I think there are ways to accelerate it and have an example code/paper I'm working on I could send you. — Richard, Jan 03 '18 at 19:59
@Richard That would be great! I will email you shortly. I can send you my C++ code sample as well. — motam79, Jan 03 '18 at 20:15
@motam79: You can find my email pretty easily from my profile. If we're able to figure something out, I can adjust my answer here to reflect what's needed. — Richard, Jan 03 '18 at 20:35
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/162463/discussion-between-motam79-and-richard). — motam79, Jan 03 '18 at 23:16

user3666197 · Answer 2 · 2018-01-03T18:25:38.817

Ad 1 )

This is a great example, where details do matter. Tpar(ideal) = N * t / C is more a wish than something that could happen in reality.

Intel has done indeed a great job in re-fabricating their hardware know-how into releasing software tool, that can benefit from their super-detailed knowledge about their own processor microarchitecture magics. No one else can do it better for Intel CPU-s, no one else easily port it, to deliver any similar performance on some other CPU microarchitecture ( so be careful about your actual CPU, the more if it was cloud-abstracted )

Why overhead-strict Amdahl's Law?

Why? Because these very overheads decide more than number of cores.

The point is, as the "usefull"-payload size gets smaller and smaller, the overheads ( even those very small, as in a superoptimised tool, like the TBB out of question is ) -- these overheads are always accrued onto the pure-[SERIAL] part of the problem computing graph.

So, as we continue to grow smaller and smaller in the [PARALLEL] payloads, their principally non-zero costs of { setup | termination } that the per-core scheduling actually does cost, will at some moment become higher, than any "next" benefit from inversely proportional factor 1 / numCOREs which applies only to the linear-duration of the net [PARALLEL]-computing path, but all these add-on overheads sum up and extend the [SERIAL]-computing path faster, than any growing numCOREs can compensate and the speedups grow below << 1x.
Q.E.D.

Ad 2 )

This one is within the playgrounds setup above, a minimum pain game.

Given one wants to speedup an about ~ 4,000 CPU uops ~ <= 1 [us], one must not spend literally a single nanosecond on all latencies and add-on overheads if trying to do so, supposing a final speedup yet remains at least >= 1x

If we do not believe in fairytales, the way to look is to get FPGA for a PoC prototyping and ASIC/SoC for production grade deployment.

If the economy of your Project cannot handle all the associated costs, just forget to get any magic it for free. It costs. Always. But if your business domain or Research funds can cope with, this is a direction to go for.

A Bonus: Vectorised code can crash on some CPU-s^{( better avoid this )}:

In Quant modelling, performance is money, so let me also share one recent Known Issue, from an extremely tight performance tweaking of micro-payloads ( having hands dirty in assembly ). Hope it can save any unwanted issues, if going into code performance-optimisation in your Quant modelling efforts:

Intel Hyperthread corruption errata (SKZ7/SKW144/SKL150/SKX150/SKZ7/KBL095/KBW095) Short Loops Which Use AH/BH/CH/DH Registers May Cause Unpredictable System Behavior.

Problem:
Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. RAX, EAX or AX for AH) may cause unpredictable system behavior. This can only happen when both logical processors on the same physical processor are active.

Implication:
Due to this erratum, the system may experience unpredictable system behavior. This errata may impact the ... the guest OS.

References:
https://caml.inria.fr/mantis/view.php?id=7452 http://metadata.ftp-master.debian.org/changelogs/non-free/i/intel-microcode/unstable_changelog https://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-spec-update.html https://www.intel.com/content/www/us/en/processors/core/7th-gen-core-family-spec-update.html https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v6-spec-update.html https://www.intel.com/content/www/us/en/processors/xeon/xeon-e3-1200v5-spec-update.html https://www.intel.com/content/www/us/en/products/processors/core/6th-gen-x-series-spec-update.html

:o) have heard this first time after 60 years. Classical latin herritage - so often used, both during education and in terms like "***ad-hoc***", "***ad absurdum***", "***ad informandum***", "***ad infimum***" — user3666197, Jan 03 '18 at 17:47
So you are just using it in its Latin sense of *"towards"*, or *"in the direction of"* - sorry, I just wasn't expecting Latin. I must confess to not knowing *"ad infinimum"* either - sounds like an overweight mother to me. — Mark Setchell, Jan 03 '18 at 18:03
:o) mathematics, elementary terms { supremum, infimum, .. } N/P. I hear it day by day, so you have indeed got me by your first question. — user3666197, Jan 03 '18 at 18:08

score 2 · Answer 3 · answered Jan 04 '18 at 19:25

@Richard has the right answer (and TBB tutorials talk about the concept of scheduling overhead amortization), and usually I would just leave this as a comment, but there is one thing I wanted to add.

TBB uses "greedy scheduling," for tasks. If there is a task created by a prior task's execution (technically, if the task returns a task pointer) that task is the next one run on a thread. This has two benefits:

The last task may have loaded or generated data being used by the next task, helping with data locality.
We skip the process of picking the next task to run (whether it is on the local thread's queue is is one available for stealing from another thread). This greatly reduces the scheduling overhead.

tbb::flow_graph uses the same idea; if a node has one or more successors, at the completion of its execution one of those successors is chosen as the next to run.

The downside of this is that if you want to change this behavior (executing the nodes in "breadth-first" instead of "depth-first" order) you have to jump through some hoops. It will also cost you in scheduling overhead and loss of locality.