Perf shows 100% of L1 cache hits occuring in two separate parts of the program

Question

I ran the following Linux perf commands:

perf record -e mem_load_retired.l1_hit:P -c 10000 -a -- ./Program_to_Test.exe
perf report > mem_load_retired.l1_hit.txt
perf annotate > mem_load_retired.l1_hit_ann.txt

The annotate file shows that 100% of the mem_load_retired.l1_hit instances occurred at line 231, and again at lines 257-258:

mem_load_retired.l1_hit 231  100.00 vcvttpd2qq %zmm1,%zmm0{%k7}

mem_load_retired.l1_hit 257   66.67 vmovapd %zmm2,(%r15,%r14,1)
mem_load_retired.l1_hit 258   33.33 add %r9,%r14

Perf further shows that 100% of the mem_load_retired.l1_miss instances occurred at line 257, and none at 231.

My question is: how can 100% of the L1 cache hits occur at two parts of the code separated by 26 lines?

UPDATE: Following comment below by Peter Cordes, I removed all unneeded line labels, and the distribution of hits is different:

mem_load_retired.l1_hit 230   16.67 vmulpd %zmm28,%zmm0,%zmm1
mem_load_retired.l1_hit 231   16.67 vcvttpd2qq %zmm1,%zmm0{%k7}
mem_load_retired.l1_hit 232   16.67 vcvtuqq2pd %zmm0,%zmm2{%k7}

mem_load_retired.l1_hit 257   33.33 vmovapd %zmm2,(%r15,%r14,1)
mem_load_retired.l1_hit 258   16.67 add %r9,%r14

There is a necessary loop label between these two sections, but apparently because it has a return label to jump back, perf does not count it as a label designating a new function. The numbers above add up to 100%, so nothing is duplicated as before. This result would also apply to C and C++ with labels.

Doesn't report / annotate show percentages by function? So 100% of the events *for that function* are on that first instruction, and the 2nd two are in a different function I assume. — Peter Cordes, Aug 07 '20 at 22:17
They're in separate parts of the same function, so that's not it. — RTC222, Aug 07 '20 at 22:29
Are you sure there isn't a label / symbol between them, that perf treats as the start of a function? If it's hand-written in asm `foo:` looks like a function. — Peter Cordes, Aug 07 '20 at 22:32
Yes, each is preceded by a separate label and that may be the issue. The labels are not used for jumps (they just demarcate sections now). I'll try removing the labels and post back. — RTC222, Aug 07 '20 at 22:37
Yes, clearly that would be the issue. `perf` doesn't care if labels are jumped to or not, or if the code before falls through into this label, it just assumes that every label visible in the symbol table is a function, because that's the case for C compiler output. — Peter Cordes, Aug 07 '20 at 22:52
I updated the question above to show the results. You were correct about the labels causing the difference. — RTC222, Aug 07 '20 at 23:08
*because it has a return label to jump back* - that doesn't sound likely. Did you maybe use a local label for the top of the loop, one that doesn't make it to the asm output? Like `.Lloop` instead of `looptop:` in GAS? IDK what you mean by a "return label", or how `perf report` would figure out that a label was a "return" label. (A [mcve] would help). I doubt it's going to treat it differently because it's the target of a backwards conditional branch, although that is maybe *possible*. Perf report's GUI does connect jumps with branch targets. — Peter Cordes, Aug 07 '20 at 23:32
Not a local label -- the section begins with Exponent_Label_0: It executes four lines and ends with jl Exponent_Label_0. It did not affect the count I showed above, as did the labels without a return, so I assumed that's because it has a return jump. — RTC222, Aug 07 '20 at 23:43
When you say "a return" you apparently mean a backwards `jcc`. "return" would normally mean `ret`, so that's confusing terminology. If you mean a backwards conditional branch, say "backwards jcc", not "return jump". Anyway, interesting, maybe `perf record` does ignore some labels based on them being targets of certain kinds of branches, unlike GDB. — Peter Cordes, Aug 08 '20 at 00:12
Yes, jump would be more accurate terminology. It's a conditional jump label to return to the top of the loop (jl Exponent_Label_0). — RTC222, Aug 08 '20 at 00:15

Perf shows 100% of L1 cache hits occuring in two separate parts of the program

0 Answers0