5

Consider this code:

.globl _non_tail, _tail
.text

.code32
_non_tail:
    lcall $0x33, $_non_tail.heavensgate
    ret

.code64
_non_tail.heavensgate:
    # do stuff. there's 12 bytes on the stack before the first argument
    lret

.code32
_tail:
    pushl (%esp)
    movw %cs, 4(%esp)
    ljmp $0x33, $_tail.heavensgate

.code64
_tail.heavensgate:
    # do stuff. there's 8 bytes on the stack before the first argument
    lret

Will _tail cause the return stack buffer to mispredict future returns? On the one hand, it's pairing a near call with a far return, but on the other hand, it's still returning to the exact same place that it would have normally.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    You can also do `push cs` \ `call near function` to call a function, in your current `cs`, that returns far. I assume neither case will be easily predicted. – ecm Jun 20 '22 at 17:18
  • 4
    Will the CPU even speculate past a far call or ret? That's such an edge case (far calls are almost never used, let alone intra-privileged ones like the *heavensgate*, which comes from Malwares) and the Return Address Predictor would also need to remember at least the original bitness for the FE to speculate ahead of the execution of an `lret`. It seems cumbersome, from a quick test I don't see any difference between a `push cs / push ret_label / jmp 33h:func/retf` (not predicated normally) and `call 33h:func / retf` (predicted). – Margaret Bloom Jun 20 '22 at 17:25
  • ^ I tried to adapt the first two columns of [Table 1 of Henry Wong's post about RAP/RAS](https://blog.stuffedcow.net/2018/04/ras-microbenchmarks/). – Margaret Bloom Jun 20 '22 at 17:28
  • 1
    @MargaretBloom I'm not really interested in what happens with the far return itself. I'm interested in what happens with all of the subsequent near returns. – Joseph Sible-Reinstate Monica Jun 20 '22 at 18:31
  • 3
    I suspect this would break future rets in the caller of `_tail`, since there was never a near `ret` to balance the near `call`. I'm assuming that far ret and far call don't interact with the RAS (return address stack) predictor, as @MargaretBloom suggested. (As far as real-world usage of far jumps/calls, WOW64 does a far call from 32-bit user-space into 64-bit user-space DLL code at least to execute `syscall` instead of using 32-bit `sysenter`. Seems inefficient to me, but it's *possible* Intel would have put some engineering effort into predicting that commercially-relevant return.) – Peter Cordes Jun 21 '22 at 04:12
  • 1
    @JosephSible-ReinstateMonica Ah ok, sorry. I misunderstood you. Anyway, it seems that far calls/rets are ignored by the RAS/RAP. A far call will take 214 cycles on average on my CPU. A near call that contains only a far call or only a push+far_jump will take 218 cycles regardless, which seems consistent with the timing from Henry Wong (~4 cycles per predicted return + 214 cycles per far call/jmp) and I think shows that far rets are ignored (or the near call would be mispredicted). – Margaret Bloom Jun 21 '22 at 09:14
  • @MargaretBloom: Since you have experimental results compatible with my guess, that's sufficient to post an answer. Feel free to quote or paraphrase my comment. It's not a surprising result, although if you have time to write up how you benchmarked, it might be fun to include that. – Peter Cordes Jun 21 '22 at 12:16
  • @PeterCordes I'm too rusty to write an answer :) please go ahead if you feel like so – Margaret Bloom Jun 22 '22 at 17:57

0 Answers0