1

I try to parallelize one hotspot of my program in C++ with OpenMP, but it das not scale. While it needs 25 seconds on 1 thread I only achieve 21 seconds with 2 threads. I did a Locks & Wait analysis with Intel VTune Amplifier, but it does not really help me. It looks like:

Result of the VTune Amplifier

I especially do not understand where the mkl_blas_dcopy comes from and what it calling it (even if I remove my parallel region I have this call and a second thread in the timeline).

I tried to get more information out of the Top-Down Tree, but it is not really helpful for me.

enter image description here

An Advanced Hotspots Analyses also did not give me more information. How do I have to approach this issue in order to identify the problem?

Additional information: Before I had a much worse overall runtime, but I did lots of optimisations in the serial code and could increase the performance but after that my code does no more scale.

Many thanks in advance!

Edit: Here also the timeline, where no Transitions are shown, independent from how near I zoom in. In this case I used another testcase with 8 threads. enter image description here

user3572032
  • 133
  • 14
  • why didn't you show the locks&waits diagram which indicates how the threads synchronize? Meanwhile from the numbers of wait&spin time I can conclude that worker threads spent a lot of time waiting for a work. Well, it is quite legitimate if you really have no parallel work for them.. No idea why MKL is mentioned, did you link MKL to your app? – Anton Nov 20 '14 at 11:51
  • I added a timeline. In that case another case, because the last one from above can't be opened anymore. But there is the same problem: No transitions are shown. – user3572032 Nov 20 '14 at 13:04

2 Answers2

3
  1. What version of VTune do you use? Looks like not the latest - frame rate for OpenMP regions that is on your screenshot is removed in current version. It worth trying new 2015 update 1, there were made some fixes and improvements for OpenMP analysis.
  2. What compiler and OpenMP runtime do you use? If it is Intel OpenMP (and compiler), VTune analysis will be much more informative for OpenMP regions. Just change grouping in Bottom-up from "Funcion/callstack" to "OpenMP region/..." - you'll find much interesting.
  3. You see mkl_blas_dcopy because you seem to use MKL functions in your code. mkl_blas_dcopy is just an internal MKL function. You can find actual MKL call in your code looking at the stack panel on the right, when "mkl_blas_dcopy" hotspot is selected in Bottom-up - you should be able see call chain up to main().
  4. MKL is already parallelized with OpenMP. It is possible that you put MKL call inside your own OpenMP region. If this is the case, it is not optimal - OpenMP is not good when nesting. You should choose, use parallel version of MKL without OpenMP, or serial MKL library inside OpenMP parallel region. You can control serial/parallel MKL setting via linking, see MKL Link Advisor: https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor
  5. Each frame in timeline on your screenshot is likely an OpenMP region from MKL. There are seem to be many parallel regions of short duration, that may indicate MKL is called from a loop. So each iteration it starts, executes and stops OpenMP parallel region. Start and Stop actions have some overhead, that counts to your big waiting time. So it may worth trying serial MKL version inside outer OpenMP loop, to avoid multiple parallel region re-entrance.
  • Thanks. I will try and investigate your points and give you the results when I am done. I am using older version, because on our cluster no newer versions are available. It ist the Cluster Studio with icc 2014 and an older Amplifier. – user3572032 Nov 20 '14 at 15:15
  • I updated to Amplifier 15 and it gives me different results. But those make more sense to me. I found out that the NAG Library calls MKL in parallel and could fix this "issue". I have to check whether this influences the total runtime. Thank you again. – user3572032 Nov 26 '14 at 09:52
  • I came here in order to see an explanation about what the measurements of vtune stand for. These measurements and their meaning should only change very rarely from one version of vtune to another. My code needs much more time than reported in vtune. I wanted to see where this time is being spent. vtune "locks and waits" is of no help. – Frank Puck Apr 25 '22 at 17:59
1

Transitions are shown for synchronization objects. In this case the waiting time likely comes from OpenMP runtime inside MKL library. In VTune you will see this time as overhead and spin time, in more recent versions.

  • How can I see, where the MKL is called, because I never do it explicitly? – user3572032 Nov 20 '14 at 15:12
  • 1
    Look at the stack panel on the right, when "mkl_blas_dcopy" hotspot is selected in Bottom-up - you should be able see call chain up to main(). If you don't call MKL, maybe you call some other library that may link to MKL. – Kirill Rogozhin Nov 21 '14 at 07:51