Working example of UMONITOR/UMWAIT-based assembly (asm) spin-wait loops as a replacement for PAUSE-based test-test-and-set loops

Question

In the Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel gives an example of a PAUSE-based spin-wait loop ("Example 11-4. Spin-wait Loop and PAUSE Instruction").

However, as for the UMONITOR/UMWAIT instruction, there are only mentions without any example ("Support for user level low-power and low-latency spin-loop instructions UMWAIT/UMONITOR and TPAUSE", and "UMWAIT/UMONITOR/TPAUSE instructions enable power savings in user level spin loops"). These new instructions came out with the Tremont Microarchitecture, released by Intel on September 2020 within the Elkhart Lake platform.

Can you please give a working example, on the assembly language, of SpinWaitAndAquireLock and ReleaseLock functions that use a byte value as a synchronization variable (in either 64-bit or 32-bit addressing mode), so that this byte can be only in one of the two states, i.e., contain just two values: cLockByteLocked (e.g., 0FFh) and cLockByteAvailable (e.g., 0)!? I can give such functions using the PAUSE instruction as a reference example, that can be compiled by nasm -f win64:

section .text

; This code exposes two global functions: (1) SpinWaitAndAquireLock; (2) ReleaseLock.


cLockByteLocked     equ 0FFh
cLockByteAvailable  equ 0

; wait for the byte to not be locked and lock it
global SpinWaitAndAquireLock

; release lock obtained previously by SpinWaitAndAquireLock
global ReleaseLock

SpinWaitAndAquireLock:
; input: rcx - address of the synchronization variable lock byte
   mov  eax, cLockByteLocked ; load the value that denotes "locked" into al; use eax to clear remaining bits to avoid false dependency on existing eax value, bits 32-63 of rax are also cleared
   jmp  @FirstCompare
@NormalLoadLoop:
   pause  
@FirstCompare:
; use the "test, test-and-set" technique
   cmp  [rcx], al   ; try normal load first instead of the locked load - "test" is a first part of "test, test-and-set"
   je   @NormalLoadLoop ; for static branch prediction, jump backward means "likely"
   lock xchg [rcx], al  ; since normal load shows the synchronization variable is available, try locked load and store ("test-and-set")
   cmp  al, cLockByteLocked
   je   @NormalLoadLoop
   ret

ReleaseLock:
; input: rcx - address of the synchronization variable lock byte
   mov   byte [rcx], cLockByteAvailable ; just use normal store
   ret

The vol.2 entry for `mwait` has pseudo-code for an example: https://www.felixcloutier.com/x86/mwait#example . It looks appropriate for `umonitor` / `umwait`, too, basically replacing `pause` in the spin loop. You still check the memory location on wake because spurious wake is possible, but hopefully you sleep until another core has stored to the location you set up a monitor on. Some other links in [How to use the monitor / mwait instructions in x86-64 assembly on Mac or baremetal](https://stackoverflow.com/q/55296528) might be relevant. — Peter Cordes, Dec 30 '22 at 08:21
@PeterCordes - can you please still write a small code that works? I saw the pseudocode but I was not sure whether I understood it correctly. However, the code that you referred at https://stackoverflow.com/questions/55296528/how-to-use-the-monitor-mwait-instructions-in-x86-64-assembly-on-mac-or-baremet is helpful, I will try to adapt it for umwait/umonitor. — Maxim Masiutin, Dec 30 '22 at 11:45
I don't have hardware to test on, so I'm reluctant to post an answer. If SDM supports it somehow (probably "waking up" / timing out immediately) that might be better than nothing. It seems fairly trivial to use, though, and `umonitor` and `umwait` both clearly documented (that's the other part of what makes me not want to take the time to write up an example). Is there any specific part you aren't sure you've understood. — Peter Cordes, Dec 30 '22 at 11:48
I have the hardware, @PeterCordes (a remote Linux server with i5 12th Gen). I will make a code and share it here and let you know. Thank you for your help! — Maxim Masiutin, Dec 30 '22 at 11:59
@PeterCordes - do you have an idea on why CPUID returns two values: one for the smallest and another one for the largest monitor-line size in bytes? In my case, both values are the same - 64 bytes. Can you speculate on why different sizes may encounter and why they both may be needed? — Maxim Masiutin, Jan 09 '23 at 22:03
I haven't checked the manuals, but it might be a similar idea to `std::hardware_constructive_interference_size` and `std::hardware_destructive_interference_size` (https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size) - the size limit where you definitely will get notified of a change, and the largest size where you *might* get woken up by a store nearby, so put other data at least away to avoid spurious wakeups. The fact that current HW works on single cache lines isn't surprising, but it's smart to leave room for different future designs. — Peter Cordes, Jan 10 '23 at 03:09
@PeterCordes - I implemented the instructions in https://github.com/maximmasiutin/FastMM4-AVX and they gain significant benefit when the number of virtual threads used by the program exceeds the number of physical threads provided by the CPU; therwise there is no benefit. In the FastMM4-AVX, for Linux on a 12-thread CPU with 64 logical threads, the program with umonitor/umwait implementation is 6(!) times faster than one of "pause", the benchmark code is at Tests/Benchmarks/Realloc.dpr. — Maxim Masiutin, Mar 19 '23 at 20:59

Working example of UMONITOR/UMWAIT-based assembly (asm) spin-wait loops as a replacement for PAUSE-based test-test-and-set loops

0 Answers0