Why does Intel Threading Building Blocks (TBB) parallel_for
have such a large overhead? According to section 3.2.2 Automatic Chunking in the Tutorial.pdf
its around half a millisecond. This is an exert from the tutorial:
CAUTION: Typically a loop needs to take at least a million clock cycles for parallel_for to improve its performance. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.
From what I have read so far TBB uses the threadpool (pool of worker threads) pattern internally and it prevents such bad overheads by only spawning worker threads once initially (which costs hundreds of microseconds).
So what is taking the time? Data synchronization using mutexes isn't that slow right? Besides doesn't TBB make use of lock-free data structures for synchronization?