0

In my C++ program, I am using boost libraries for parallel programming. Several threads are made to join() on other threads in a part of the program.

The program runs pretty slow for some inputs... In an attempt to improve my program, I tried finding hotspots using Intel VTune. The most time-consuming hotspot is shown to occur due to boost::this_thread::interruptible_wait:

enter image description here

When I checked the portion of the source code where this hotspot occurs, it shows the call to join(). I was under the impression that waiting threads do not take CPU Time. Can someone help me understand why does the thread join() operation take up so much CPU time?

Any insights on how to fix such a hotspot will be very helpful too! One way I can think of to fix such a hotspot would be to somehow detach() the threads and not join() them.

Thanks in advance!

Community
  • 1
  • 1
progammer
  • 1,951
  • 11
  • 28
  • 50

1 Answers1

1

I was under the impression that waiting threads do not take CPU Time

It really depends on how the threads wait. They may be busy waiting (i.e. spinning) to react as quickly as possible to whatever they are waiting for. The alternative of yielding execution after every check means potentially higher delays from operating system scheduling (and thread switching overhead).

VTune will mercilessly pick up on all your threading library overhead, you will need to filter appropriately to figure out where your serial hotspots are and if your parallelization has mitigated them.

If your threads spend a significant amount of time waiting on the join, your parallel section is probably not well-balanced. Without more information on your problem it's hard to tell what the reason is or how to mitigate it, but you should probably try to distribute the work more evenly.

On another note, the recent spectre/meltdown fixes appear to have increased VTune's profiling overhead. I would be careful taking the results at face value (does your program run close to the same amount of time with and without profiling?).

Edit: Related material here and here. Following the instructions in the linked page for disabling the kernel protections helped in my case, although I have not tested it on the latest VTune update.

Max Langhof
  • 23,383
  • 5
  • 39
  • 72
  • Thanks for the answer! At this moment, I am not much concerned about balancing my parallel section because the problem-at-hand is difficult to parallelize while well-balancing the thread workload. My main concern is whether the join() function in the boost libraries does a busy waiting or not... Without profiling, it runs around 50% faster for the input that I tested. – progammer Apr 19 '18 at 13:30
  • Yeah, that sounds familiar. Basic Hotspots has never worked properly for me with OMP-parallelized code (over several updates) by virtually doing no parallel work, but I don't know how common that is. Have you tried Advanced Hotspots, General Exploration or the Visual Studio Performance Analyzer? They use different sampling techniques and (in my experience) can give better results - but between the Windows and VTune updates it's hard to give a general statement. – Max Langhof Apr 19 '18 at 13:39
  • I have tried Advanced Hotspots before. So, are the results of my "Basic Hotspots" incorrect? That is, does the join() function take no CPU time? – progammer Apr 19 '18 at 13:55
  • The results are not worth anything if your program runs 50% longer while profiled - either because you are measuring the actual profiling overhead or because Basic Hotspots is just not working correctly. However, that doesn't mean `join()` takes no CPU time. Did Advanced Hotspots work when you tried it? Also check the edit in my answer. – Max Langhof Apr 19 '18 at 14:14
  • In your Related material, Intel recommends not to use "Advanced Hotspots". They also say that "Basic Hotspots" work correctly. Yes, I remember that the "Advanced Hotspots" showed me a result sometimes and crashed few other times. – progammer Apr 19 '18 at 14:27
  • I am just giving you my experiences, which include Basic Hotspots failing miserably on parallel code. That article starts with an update stating that Update 2 fixes some of the issues. In particular, Advanced Hotspots appears to no longer crash, but it still has absolutely horrendous overhead in my experience. However, before and after Update 2, Advanced Hotspots remained usable by disabling the kernel protections. If you can live without stack sampling (which is one of the main reasons for the overhead), try General Exploration. – Max Langhof Apr 19 '18 at 14:43