BR/RET timing discrepancy when returning from contrived subroutine to a modified return address

Question

In my adventures of experimenting around with the 64-bit ARM architecture, I noticed a peculiar speed difference depending on whether br or ret is used to return from a subroutine.

; Contrived for learning/experimenting purposes only, without any practical use
foo:
    cmp     w0, #0
    b.eq    .L0
    sub     w0, w0, #1
    sub     x30, x30, #4
    ret
.L0:
    ret ; Intentionally duplicated 'ret'

The intent of this subroutine is to make the caller of foo "reenter" foo w0 times by making foo return to the instruction that called foo in the first place (i.e. the instruction immediately before the one to which x30 points). With some rough timing, with w0 being some sufficiently high value, it took about 1362 milliseconds on average. Curiously, replacing the first ret with br x30 makes it run over twice as fast, taking only 550 milliseconds or so on average.

The timing discrepancy goes away if the test is simplified to just repeatedly calling a subroutine with a bare ret/br x30. What makes the above contrived subroutine slower with a ret?

I tested this on some kind of ARMv8.2 (Cortex-A76 + Cortex-A55) processor. I'm not sure to what extent big.LITTLE would mess with the timings, but they seemed pretty consistent over multiple runs. This is by no means a real [micro]benchmark, but instead a "roughly how long does this take if run N times" thing.

Peter Cordes · Accepted Answer · 2022-01-01T02:51:08.223

Most modern microarchitectures have a special predictor for call / return, which tend to match up with each other in real programs. (And predicting returns any other way is hard for functions with many call-sites: it's an indirect branch.)

By playing with the return address manually, you're making those return-predictions wrong. So every ret causes a branch mispredict, except the one where you didn't play with x30.

But if you use an indirect branch other than the one recognized specifically as a ret idiom, e.g. br x30, the CPU uses its standard indirect-branch prediction method, which does well when the br goes to the same location repeatedly.

A quick google search found some info from ARM for Cortex-R4 about the return-predictor stack on that microarchitecture for 32-bit mode (a 4-entry circular buffer): https://developer.arm.com/documentation/ddi0363/e/prefetch-unit/return-stack

For x86, https://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ is a good article about the concept in general, as well as some details on how various x86 microarchitectures maintain their prediction accuracy in the face of things like mis-speculated execution of a call or ret instruction that has to get rolled back.

(x86 has an actual ret opcode; ARM64 is the same: the ret opcode is like br, but with a hint that this is a function-return. Some other RISCs like RISC-V don't have a separate opcode, and just assume that branch-to-register with the link register is a return.)

BR/RET timing discrepancy when returning from contrived subroutine to a modified return address

1 Answers1