0

I'm trying to optimise a part of code that is called within a parallel region (OpenMP). I did a memory access analyses with Intel VTune Amplifier 2015 and am a bit confused about the result. I repeated the analyses with optimization level O1, O2 and O3 with Intel Composer 2015, but the outcome is the same. The Amplifier claims, that most LLC misses appear in the following three lines:

__attribute__ ((aligned(64)))    double       x[4] = {1.e0,-1.e0, 0.e0, 0.e0};
__attribute__ ((aligned(64)))    double       y[4] = {0.e0,-1.e0, 1.e0, 0.e0};
__attribute__ ((aligned(64)))    double       z[4] = {0.e0, 0.e0,-1.e0, 1.e0};

The data is aligned because it is accessed later in vectorized code. I can't publish the whole code here, because it has copyright. This are about 75% of the total cache misses within this function, although there are lots of calculations and other arrays later in the code. For O0-optimization I get much more realistic results, because there where lines like

res[a] += tempres[start + b] * fact;

But there the whole execution needs much more time (which is clear). But which results can I trust? Or which alternative software can I use for testing.

Thanks in advance!

user3572032
  • 133
  • 14
  • [cachegrind](http://valgrind.org/docs/manual/cg-manual.html) is another tool if you want to verify the result. – Matt Dec 18 '14 at 09:27
  • How many cache misses do you get per function execution? Does the tool teml you whether these are i- or d-cache misses? – MikeMB Dec 18 '14 at 10:36
  • Is this function called in a tight loop? – MikeMB Dec 18 '14 at 10:40
  • I have to count the number of calls. It is a relatively big loop in that case (>100 SLOCs). – user3572032 Dec 18 '14 at 12:56
  • Well, then it is not necessarily a suprise, that the cachelines get evicted betwenn two consecutive runs of the function. – MikeMB Dec 18 '14 at 15:46
  • 1
    Hardware events might be attributed with at least +-1 instruction offset/error. Sampling hardware events - introduce even more inaccuracy. And even if you increase accuracy very much, you still have difficult times to get exact answer, because it's Out of Order architecture, which means that the moment when you "sit and wait" for execution port is not the same moment when you "issued" instruction. And there is also a problem with inaccurate debug info for -O2/-O3 compiled binary, which may also lead to big confusion when mapping instructions vs. source line vs. hardware events. – zam Dec 18 '14 at 21:06
  • 1
    So if you really want to find out the root cause down the road - try to explore assembly in VTune to make sure that correlation between source lines and instructions (debug info) is what you expected. And also take attention to absolute cache miss numbers(like suggested) to understand statistical impact for the sampling. – zam Dec 18 '14 at 21:09

1 Answers1

0

Looking only at percentages can be misleading (75% of 100 is less than 10% of 1000) - you'll need to look at the absolute number of misses when you compare.

Cache behaviour is also difficult to intuit, particularly in combination with compiler optimisations and CPU pipelines.
It looks like the optimised builds mostly miss the cache on initialisation (not too surprising that it does) but manage to keep almost the entire computation in-cache, so I don't see a problem here.

If you want to be sure, you'll need to study the generated assembly and the reference manuals for your hardware.

Searching for a tool that confirms your expectation is largely a waste of time, as you can't be sure that that tool isn't the one that's in error.

molbdnilo
  • 64,751
  • 3
  • 43
  • 82
  • Thanks so far molbdnilo. This portion of code has also the most cachemisses in absolute values. I have to count the frequency the function is called and compare it again. – user3572032 Dec 18 '14 at 12:55
  • @user3572032 Of course it has - you should compare the absolute values between optimised and unoptimised builds, using the same input. – molbdnilo Dec 18 '14 at 12:59