7

I've been reverse engineering the EnterCriticalSection function on Windows 10 and found this interesting spin-loop:

enter image description here

It goes:

lbl_loop:

mov     ecx, [rsp+60h]
mov     ecx, [rsp+60h]
mov     ecx, [rsp+60h]
pause
mov     ecx, [rsp+60h]
inc     ecx
mov     [rsp+60h], ecx

cmp     ecx, eax
jb      lbl_loop

So my question is - what is the purpose of reading 4 times from [rsp+60h] and then writing back into it from a loop?

Why couldn't they just do:

lbl_loop:

pause
inc     ecx

cmp     ecx, eax
jb      lbl_loop

mov     [rsp+60h], ecx

PS. Note this is a production build of Windows 10. And the rest of the EnterCriticalSection function is optimized. So this is not a debugging build.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
c00000fd
  • 20,994
  • 29
  • 177
  • 400
  • Guessing here but maybe those 'redundant' instructions are used to influence instruction alignment for the pause and so it's an optimization technique? – vengy Aug 09 '23 at 19:45
  • @vengy They usually use a variation of a `nop` instruction for that. Like `nop [rax + rax * 8]` and the like. There's a whole bunch of those for different lengths. Here though, it is hitting the RAM, or cache in this case. – c00000fd Aug 09 '23 at 19:49
  • 3
    Looks a bit like a variable that was declared `volatile`, and what the optimizer came up with after stripping all superfluous code that's neither a load nor a store. – IInspectable Aug 10 '23 at 11:40
  • @IInspectable hmm, interesting. Although if that is the case it's a kinda bad "optimizer". – c00000fd Aug 10 '23 at 11:57
  • 1
    There isn't much an optimizer can do when the compiler is instructed [not to optimize loads and stores](https://godbolt.org/z/3aMj88oPG). I'm still not entirely convinced that this kind of code would slip through without an engineer having a look at the compiler's output every now and then. I would rather expect that this code is in fact the result of very deliberate decisions. – IInspectable Aug 10 '23 at 13:36
  • 2
    Maybe the whole purpose of this code was not to be optimized in the first place? It looks like a loop to waste cycles, it's looping `eax` times without doing anything. Maybe to reduce contention? The increment part can be understood as acting on a volatile var as IInspectable pointed out, maybe to avoid having the compiler remove the loop entirely. The three loads may either be intentional to further slow each iteration down or they may be what remains after optimizing away all other instructions without side effects. [Something like this ad hoc example](https://godbolt.org/z/xPdMsercn). – Margaret Bloom Aug 11 '23 at 09:29
  • 1
    @MargaretBloom it definitely looks like un-optimized assembly code. The question is why? In that loop the `pause` instruction itself is enough to "waste" enough time. There's really no need to do 4 loads and 1 store there. Unless I don't understand something. My only guess is that the author was trying to hit the CPU cache multiple times. But again, the question is what for? IMHO it only wastes CPU cycles and thus consumes more power. – c00000fd Aug 11 '23 at 11:20
  • 2
    My guess is that either they were already using a `volatile` loop variable and some expression of it before adding the `pause` instruction to the loop; or they have some debug code that they thought would be optimized away completely but in fact is not (maybe not realizing that with `pause` they don't need a `volatile` anymore). Studying the evolution of `RtlEnterCriticalSection` could help see if the loop was written from scratch or was a modification of an existing one. I have a Windows 7 VM but that loop doesn't seem to be there (not in that form, at least). Those loads make no sense to me. – Margaret Bloom Aug 11 '23 at 16:51
  • This is a delay loop. Making it run faster would defeat the purpose. @vengy – prl Aug 11 '23 at 16:53
  • @MargaretBloom: Yes, looks like a loop to run `pause` multiple times before the next retry, as part of the backoff strategy for contention. `[rsp+60h]` is a local variable, presumably *not* shared. IDK why it would be `volatile`, but maybe there was a reason that made sense to the person who wrote the source at the time. Maybe involving compiler memory barriers and maybe even `/volatile:ms` acquire semantics on loads? – Peter Cordes Aug 11 '23 at 17:50
  • 1
    I agree with @MargaretBloom's overall point: the multiple loads in the asm were probably not a goal, rather just a side-effect of how they wrote the C++ source. If you want it to run slower, `pause` more. (Although it pauses the front-end either about 5 cycles on AMD and older Intel, or about 100 cycles on Intel since Skylake). The loop-carried dependency on store/reload of the counter is also enough of a bottleneck that the extra loads aren't making this thread slower, just stealing back-end cycles from the other hyperthread. – Peter Cordes Aug 11 '23 at 17:52
  • @MargaretBloom oh, that loop was actually used in `ntdll!RtlpEnterCriticalSectionContended` that is invoked if the critical section is locked. It is quite a large function, but if you search by the byte sequence `8B 4C 24 60 8B 4C 24 60 8B 4C 24 60` you should find it. Unless it's that OS/build specific. Maybe it was a bug and they fixed it. (I'm using Win10 in my VM.) – c00000fd Aug 11 '23 at 18:57
  • 1
    @PeterCordes could be. It also looks like that loop is a part of a bail-out, or a backoff, (I think they call it that) mechanism from a lock. The counter in `eax` in that loop comes as a result of a CPU time stamp retrieved from the `RDTSC` instruction. Which hints at them trying to randomize that delay to prevent future contention. – c00000fd Aug 11 '23 at 19:03
  • @c00000fd: Yes, randomized backoff with increasing delays between tries is a standard strategy if you're going to retry a few times before yielding the CPU to another thread. It takes more code than just a fixed one or two `pause` delay between retries, but causes less contention with multiple threads waiting to enter. – Peter Cordes Aug 11 '23 at 20:18
  • 1
    @PeterCordes In this case, the extra reads are intentional and explicitly coded. (Though there is no explanation why. Maybe it is useful for non-x86 processors?) – Raymond Chen Aug 31 '23 at 16:28
  • @RaymondChen wouldn't that be excluded from the build by a preprocessor #if statement or by a template? Btw, to eliminate this debate, you have access to the current source code, can you check what's in there? it's in the `ntdll!RtlpEnterCriticalSectionContended` function. – c00000fd Aug 31 '23 at 16:33
  • @c00000fd Yeah, there is no architectural `#ifdef` around it. It just happens for everyone. No comment about why it's there. It's just... there. Also no comment when it was added (as part of a larger change). – Raymond Chen Aug 31 '23 at 16:39
  • @RaymondChen ok, thanks for checking and for chiming in. – c00000fd Aug 31 '23 at 18:56

0 Answers0