I'm trying to apply some performance engineering techniques to an implementation of Dijkstra's algorithm. In an attempt to find bottlenecks in the (naive and unoptimised) program, I'm using the perf
command to record the number of cache misses. The snippet of code that is relevant is the following, which finds the unvisited node with the smallest distance:
for (int i = 0; i < count; i++) {
if (!visited[i]) {
if (tmp == -1 || dist[i] < dist[tmp]) {
tmp = i;
}
}
}
For the LLC-load-misses
metric, perf report
shows the following annotation of the assembly:
│ for (int i = 0; i < count; i++) { ▒
1.19 │ ff: add $0x1,%eax ▒
0.03 │102: cmp 0x20(%rsp),%eax ▒
│ ↓ jge 135 ▒
│ if (!visited[i]) { ▒
0.07 │ movslq %eax,%rdx ▒
│ mov 0x18(%rsp),%rdi ◆
0.70 │ cmpb $0x0,(%rdi,%rdx,1) ▒
0.53 │ ↑ jne ff ▒
│ if (tmp == -1 || dist[i] < dist[tmp]) { ▒
0.07 │ cmp $0xffffffff,%r13d ▒
│ ↑ je fc ▒
0.96 │ mov 0x40(%rsp),%rcx ▒
0.08 │ movslq %r13d,%rsi ▒
│ movsd (%rcx,%rsi,8),%xmm0 ▒
0.13 │ ucomis (%rcx,%rdx,8),%xmm0 ▒
57.99 │ ↑ jbe ff ▒
│ tmp = i; ▒
│ mov %eax,%r13d ▒
│ ↑ jmp ff ▒
│ } ▒
│ } ▒
│ }
My question then is the following: why does the jbe
instruction produce so many cache misses? This instruction should not have to retrieve anything from memory at all if I am not mistaken. I figured it might have something to do with instruction cache misses, but even measuring only L1 data cache misses using L1-dcache-load-misses
shows that there are a lot of cache misses in that instruction.
This stumps me somewhat. Could anyone explain this (in my eyes) odd result? Thank you in advance.