1

I've implemented TBB parallel_for in this way on a 4-core(8 threads) laptop i7-3920XM. It tooks about 15s to complete the calculation, and cpu usage is about 70% for each core. If I initialize a fixed number of threads, like g_nthreads = 4, for the job, it takes 12s. It's ok and I'm satisfied with the default way.

// g_nthreads = 8, on 4-core(8 threads) laptop
int g_nthreads = tbb::task_scheduler_init::default_num_threads(); 
tbb::task_scheduler_init tbb_init(g_nthreads); 

...

// k iterates from 0 to N-1
tbb::parallel_for(0, N, [tool, pos, rock,...] (int k)  {
...
Func( tool, pos, k, rock ,...);
} 
)

The issue is if I use the same code on a 20-core(40 threads) workstation Xeon E5-2680, the performance is downgraded rapidly to 30s, in which TBB initializes 40 threads for the job automatically. cpu usage overall in this case is 12%, and only half of them are showed running. When I fix the number of threads to 4 again, it takes 13s but overall cpu usage is still ~12%.

It looks like the job doesn't necessarily need so many threads to run on a 20-core(40 threads) computer, and the overhead of dividing the job to 40 tasks is dominating the performance.

How to maximize the cpu usage and improve the performance on a 20-core(40 thread) computer for this job? Func() is an EM function requiring a lot of calculation.

Updated:

Finally I achieve the performance to 100% CPU on 40-core computer.

1, The program fixates threads to cores which normally shouldn't be the proper way. It's better to let TBB itself decide switching on cores.

2, Force the program to use tbbmalloc.dll and tbbmalloc_proxy.dll, which can be set in VS: project property--C/C++ -- Advanced -- Forced Include File --> "tbb/tbbmalloc_proxy.h". These 2 DLLs is for TBB memory management.

yfeng
  • 110
  • 7
  • it's not clear, how big is the Func()? Can you profile it to say how many milli- or nanoseconds it takes? Does it contain any locks? How big is N? – Anton Mar 30 '14 at 15:09
  • Easy way to test if the problem is TBB or Func: replace Func with something trivial (and demonstrably scalable), then re-run your test. If the problem persists then it's TBB, but I'll bet that it isn't. TBB works very well on Intel processors. – blockchaindev Mar 31 '14 at 15:28
  • Thank you for your advice. The Func is big and no lock within it. Finally I found the cause. The performance is not maximized because I fixate the threads to cores etc. Details are edited in my post. – yfeng Jun 17 '14 at 14:54

0 Answers0