In my adventures of experimenting around with the 64-bit ARM architecture, I noticed a peculiar speed difference depending on whether br
or ret
is used to return from a subroutine.
; Contrived for learning/experimenting purposes only, without any practical use
foo:
cmp w0, #0
b.eq .L0
sub w0, w0, #1
sub x30, x30, #4
ret
.L0:
ret ; Intentionally duplicated 'ret'
The intent of this subroutine is to make the caller of foo
"reenter" foo
w0
times by making foo
return to the instruction that called foo
in the first place (i.e. the instruction immediately before the one to which x30
points). With some rough timing, with w0
being some sufficiently high value, it took about 1362 milliseconds on average. Curiously, replacing the first ret
with br x30
makes it run over twice as fast, taking only 550 milliseconds or so on average.
The timing discrepancy goes away if the test is simplified to just repeatedly calling a subroutine with a bare ret
/br x30
. What makes the above contrived subroutine slower with a ret
?
I tested this on some kind of ARMv8.2 (Cortex-A76 + Cortex-A55) processor. I'm not sure to what extent big.LITTLE would mess with the timings, but they seemed pretty consistent over multiple runs. This is by no means a real [micro]benchmark, but instead a "roughly how long does this take if run N times" thing.