start_thread clone taking most of the time in parallel program - bad parallelization or wrong report?

Question

I'm currently working on parallelizing a C++ program in order to improve its performance on multi-core systems. Using OpenMP and considering the challenges (thread synchronization, data accesses, etc) we finally found a way to make the entire program parallel, but the performance improvement is not overwhelming.

Using Intel VTune Amplifier, I did a hotspot search and found out that in almost every function call that should be done in parallel, "start_thread clone" from libgomp.so is taking more time than the actual execution of the function:

Intel VTune Result Image

This is really unexpected, since I checked that, on current OpenMP implementations, there should be almost no penalty for switching from parallel and serial regions. According to this discussion:

The thread are started when your program starts ( or the first time are needed, depending on the implementation ). Pause your program anywhere else, and you'll notice the threads are still there

I did this, stopped the program in the debugger, before the first parallel region there was only one thread, afterwards, wherever I stopped (parallel or serial region), there were multiple threads. So I was convinced that there should be no overhead from "respawing" new threads each time.

Now VTune tells me differently, as far as I can understand the measurements. Can somebody help me here?

Do you switch debug info ON for all your user modules (using -g)? And is this possible for you to use Intel OpenMP implementation? Ideally you need to do both things in order to make parallel program profiling in VTune most meaningful. — zam, Jan 11 '15 at 19:46

score 1 · Answer 1 · answered Jan 12 '15 at 13:08

Please download corresponding debug package for libgomp containing debug symbols for this library - VTune will be able to classify CPU time better. Presence of libgomp in a hotspot may be caused by Spin Locks as result of bad work splitting.

Also please note that time on screenshot does not mean "start_thread clone" took 11 seconds - it means hotspot (unresolved function) took 11 seconds being called from start_thread through calcGrowthStep region.

start_thread clone taking most of the time in parallel program - bad parallelization or wrong report?

1 Answers1