1

I am currently benchmarking a project written in C++ to determine the hot spots and the threading efficiency, using Intel VTune. When running the program normally it runs for ~15 minutes. Using the hotspot analysis in VTune I can see that the function __kmp_fork_barrier is taking up roughly 40% of the total CPU time.
Therefore, I also wanted to see the threading efficiency, but when starting the threading-module in VTune, it does not start the project at all, but instead hangs at __kmp_acquire_ticket_lock when running in Hardware event-based sampling-mode. When running in user-mode sampling-mode instead, the project immediately fails with a segfault (which does not occur when running it without VTune and checking it with valgrind). When using HPC performance characterization instead, VTune crashes.
Are those issues with VTune, or with my program? And how can I find the issues with the latter?

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
arc_lupus
  • 3,942
  • 5
  • 45
  • 81

1 Answers1

1

__kmp_xxx calls are functions of the Intel/Clang OpenMP runtime. __kmp_fork_barrier is called when an OpenMP barrier is reached. If you spend 40% of your time on this function this means that you have a load balancing issue with the OpenMP threads in your program. You need to fix this work imbalance to get better performance. You can use the (experimental) OMPT support of runtimes to track what threads are doing and when they do so. VTune should have a minimal support for profiling OpenMP programs. Encountering a VTune crash is likely a bug and it should be reported on the Intel forum so that VTune developers can fix it. On your side, you can check that your program always pass all OpenMP barrier in a deterministic way. For more information, you can look at the Intel VTune OpenMP tutorial.

Note that the results of VTune should also means that your OpenMP runtime is configured so that threads are actively polling the state of other threads which is good to reduce latencies but not always for performance or energy savings. You can control the behaviour of the runtime using the environment variable OMP_WAIT_POLICY.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • That is good to know. Still, why does Vtune hang at __kmp_acquire_ticket_lock, and does not continue with executing the program? Without that issue I already would have run a load balancing check. – arc_lupus Apr 21 '21 at 14:31
  • Ticket locks are an implementation of a spin-lock. This is highly dependent of the implementation of the Intel runtime but I think `__kmp_acquire_ticket_lock` try to acquire the spin lock but does not succeed because another lock is not released by another thread (I do not know exactly why, this may be dependent of the application code). The fact that VTune report that seems normal to me as long as there is actually a lock issue. This might means that there is a non-deterministic deadlock either in your code or in the OpenMP runtime if you do not get this problem without VTune. – Jérôme Richard Apr 21 '21 at 15:01
  • Is there a way to find that deadlock in my code, especially if I did not explicitely use OpenMP in my code? – arc_lupus Apr 21 '21 at 17:32
  • If you use a library that use OpenMP, you can try to analyse the call stack when `__kmp_acquire_ticket_lock` cause an issue. You will be able to know with library cause the issue and report them the problem. Note that you probably need to (partially) disable optimization so that the stack frame will still be valid. This may change the behaviour of your program in this case. Alternatively, you can try the Intel Inspector tool which is supposed to find such issues (I never used it). – Jérôme Richard Apr 21 '21 at 17:58