I'm trying to optimise a part of code that is called within a parallel region (OpenMP). I did a memory access analyses with Intel VTune Amplifier 2015 and am a bit confused about the result. I repeated the analyses with optimization level O1, O2 and O3 with Intel Composer 2015, but the outcome is the same. The Amplifier claims, that most LLC misses appear in the following three lines:
__attribute__ ((aligned(64))) double x[4] = {1.e0,-1.e0, 0.e0, 0.e0};
__attribute__ ((aligned(64))) double y[4] = {0.e0,-1.e0, 1.e0, 0.e0};
__attribute__ ((aligned(64))) double z[4] = {0.e0, 0.e0,-1.e0, 1.e0};
The data is aligned because it is accessed later in vectorized code. I can't publish the whole code here, because it has copyright. This are about 75% of the total cache misses within this function, although there are lots of calculations and other arrays later in the code. For O0-optimization I get much more realistic results, because there where lines like
res[a] += tempres[start + b] * fact;
But there the whole execution needs much more time (which is clear). But which results can I trust? Or which alternative software can I use for testing.
Thanks in advance!