I'm currently working on parallelizing a C++ program in order to improve its performance on multi-core systems. Using OpenMP and considering the challenges (thread synchronization, data accesses, etc) we finally found a way to make the entire program parallel, but the performance improvement is not overwhelming.
Using Intel VTune Amplifier, I did a hotspot search and found out that in almost every function call that should be done in parallel, "start_thread clone" from libgomp.so is taking more time than the actual execution of the function:
This is really unexpected, since I checked that, on current OpenMP implementations, there should be almost no penalty for switching from parallel and serial regions. According to this discussion:
The thread are started when your program starts ( or the first time are needed, depending on the implementation ). Pause your program anywhere else, and you'll notice the threads are still there
I did this, stopped the program in the debugger, before the first parallel region there was only one thread, afterwards, wherever I stopped (parallel or serial region), there were multiple threads. So I was convinced that there should be no overhead from "respawing" new threads each time.
Now VTune tells me differently, as far as I can understand the measurements. Can somebody help me here?