Intel TBB Parallelization Overhead

Question

Why does Intel Threading Building Blocks (TBB) parallel_for have such a large overhead? According to section 3.2.2 Automatic Chunking in the Tutorial.pdf its around half a millisecond. This is an exert from the tutorial:

CAUTION: Typically a loop needs to take at least a million clock cycles for parallel_for to improve its performance. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.

From what I have read so far TBB uses the threadpool (pool of worker threads) pattern internally and it prevents such bad overheads by only spawning worker threads once initially (which costs hundreds of microseconds).

So what is taking the time? Data synchronization using mutexes isn't that slow right? Besides doesn't TBB make use of lock-free data structures for synchronization?

minjang · Accepted Answer · 2011-07-23T00:58:57.293

From what I have read so far TBB uses the threadpool (pool of worker threads) pattern internally and it prevents such bad overheads by only spawning worker threads once initially (which costs hundreds of microseconds).

Yes, TBB pre-allocates threads. It doesn't physically create and join worker threads whenever it sees parallel_for. OpenMP and other parallel libraries all do pre-allocation.

But, there is still overhead to wake up threads from the pool, and dispatch logical tasks to the threads. Yes, TBB exploits lock-free data structures to minimize overhead, but it still requires some amount of parallel overhead (i.e., serial part). That's why TBB manual advise to avoid very short loops.

In general, you must have a sufficient job to gain parallel speedup. I think even a 1 millisecond (=1,000 microseconds) are too small. From my experience, in order to see meaningful speedup, I needed to increase execution time around 100 milliseconds.

If the parallel overhead of TBB parallel_for is really a concern to you, it might be worthy to try a simple static scheduling. I don't have a good knowledge of TBB's static scheduling implementation. But, you can easily try on OpenMP's one: omp parallel for schedule(static). I believe this overhead would be the minimal cost in parallel for. However, since it's using a static scheduling, the benefit from dynamic scheduling (especially when work loads are not homogeneous) will be lost.

Thanks! TBB seems great and well designed! However, I am bit unsure how to interpret your first two sentences in your otherwise great answer: 1. Does OpenMP and other parallel libraries pre-allocate threads or not? 2. What are the main differences compared to OpenMP and libstdc++ parallel mode? Can you recommend a comparison web page that explains that? — Nordlöw, Jul 22 '11 at 22:26
1. Yes, they mostly pre-allocate. It's actually implementation-specific, though. — minjang, Jul 22 '11 at 22:51
2. libstdc++ parallel mode means pthread? OpenMP, which is implemented as a middle-end in a compiler (opposed to pure C++ library approach of TBB), offers a simple way to achieve parallel-for, with small parallel overhead. But, it doesn't offer good dynamic scheduler such as work-stealing as in Cilk, TBB. — minjang, Jul 22 '11 at 22:52

Intel TBB Parallelization Overhead

1 Answers1