1

I'm using Intel Advisor to analyze my parallel application. I have this code, which is the main loop of my program and where is spent most of the time:

   for(size_t i=0; i<wrapperIndexes.size(); i++){
       const int r = wrapperIndexes[i].r;
       const int c = wrapperIndexes[i].c;
       const float val = localWrappers[wrapperIndexes[i].i].cur.at<float>(wrapperIndexes[i].r,wrapperIndexes[i].c);
       if ( (val > positiveThreshold && (isMax(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].high, r, c))) ||
            (val < negativeThreshold && (isMin(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].high, r, c))) )
          // either positive -> local max. or negative -> local min.
            ANNOTATE_ITERATION_TASK(localizeKeypoint);
            localizeKeypoint(r, c, localCurSigma[wrapperIndexes[i].i], localPixelDistances[wrapperIndexes[i].i], localWrappers[wrapperIndexes[i].i]);
   }

As you can see, localizeKeypoint is where most of the time the loop is spent (if you don't consider the if clause). I want to do a Suitability Report to estimate the gain from parallelizing the loop above. So I've written this:

   ANNOTATE_SITE_BEGIN(solve);
   for(size_t i=0; i<wrapperIndexes.size(); i++){
       const int r = wrapperIndexes[i].r;
       const int c = wrapperIndexes[i].c;
       const float val = localWrappers[wrapperIndexes[i].i].cur.at<float>(wrapperIndexes[i].r,wrapperIndexes[i].c);
       if ( (val > positiveThreshold && (isMax(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].high, r, c))) ||
            (val < negativeThreshold && (isMin(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].high, r, c))) )
          // either positive -> local max. or negative -> local min.
            ANNOTATE_ITERATION_TASK(localizeKeypoint);
            localizeKeypoint(r, c, localCurSigma[wrapperIndexes[i].i], localPixelDistances[wrapperIndexes[i].i], localWrappers[wrapperIndexes[i].i]);
   }
   ANNOTATE_SITE_END();

And the Suitability Report given an excellent 6.69x gain, as you can see here:

enter image description here

However, launching dependencies check, I got this problem message:

enter image description here

In particular see "Missing start task".

In addition, if I place ANNOTATE_ITERATION_TASK at the beggining of the loop, like this:

   ANNOTATE_SITE_BEGIN(solve);
   for(size_t i=0; i<wrapperIndexes.size(); i++){
        ANNOTATE_ITERATION_TASK(localizeKeypoint);
       const int r = wrapperIndexes[i].r;
       const int c = wrapperIndexes[i].c;
       const float val = localWrappers[wrapperIndexes[i].i].cur.at<float>(wrapperIndexes[i].r,wrapperIndexes[i].c);
       if ( (val > positiveThreshold && (isMax(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].high, r, c))) ||
            (val < negativeThreshold && (isMin(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].high, r, c))) )
          // either positive -> local max. or negative -> local min.
            localizeKeypoint(r, c, localCurSigma[wrapperIndexes[i].i], localPixelDistances[wrapperIndexes[i].i], localWrappers[wrapperIndexes[i].i]);
   }
   ANNOTATE_SITE_END();

The gain is horrible:

enter image description here

Am I doing something wrong?

INTEL_OPT=-O3 -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl 
zam
  • 1,664
  • 9
  • 16
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138

1 Answers1

1

You have to use second approach, where you put ANNOTATE_ITERATION_TASK at the very beginning of loop annotation. Otherwise you get (a) wrong performance projection in Suitability, (b) Missing Start task in Correctness.

If you run Correctness for the second variant (where you put iteration task at the very beginning of loop body), then Correctness should be OK.

Your second Suitability chart is not horrible. It just says that you have to take care about task chunking (click on the "chunking" link in the tool to learn more about it). Fortunately, in fresh OpenMP chunking is "good enough" by default, see https://software.intel.com/en-us/articles/openmp-loop-scheduling . So in order to see the Advisor projection with chunking ON, you just need to switch ON corresponding check-box and it will not be that bad.

zam
  • 1,664
  • 9
  • 16
  • So...."Task Chunking" is it just `#pragma omp parallel for` (in case of OpenMP, of course)? – justHelloWorld Apr 19 '17 at 07:48
  • another question: why there is "no information available" in so many columns of the second image? I added the compiler flags to the question – justHelloWorld Apr 19 '17 at 07:50
  • I did as you said but I keep seeing "Missing start task" – justHelloWorld Apr 19 '17 at 13:05
  • You are right about Task Chunking. For Correctness (missing start task) it looks like an obvious bug, possibly caused by the fact that compiler has done specific optimization. You may submit the bug via intell support (if you have it), or otherwise you can post a question with attached application on IDZ forum. You can also post it here if you strongly prefer, but I'm just stackoverflow member, not intel support person, so it's hard to promise I can directly help. – zam Apr 19 '17 at 16:38
  • 1
    Regarding "No Information Available": these other columns are populated once you run Memory Access Pattern analysis step. You can do that with your most recent annotated version just fine – zam Apr 19 '17 at 16:42
  • I've already done a post there on IDZ, no answer yet. I'll try to send a bug report – justHelloWorld Apr 19 '17 at 16:43
  • And "Memory Access Pattern" analysis type is available if you switch to "Vectorization workflow" (it's a last "2.2" step in Workflow pane) – zam Apr 19 '17 at 16:44
  • IDZ is enough I guess. It would be better if you attach your application (binary/exe compiled with annotations) to your forum post – zam Apr 19 '17 at 16:47
  • could you please give a look at [this](http://stackoverflow.com/questions/43844396/how-should-i-interpreter-these-vtune-results) question? – justHelloWorld May 08 '17 at 09:44