In the Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel gives an example of a PAUSE-based spin-wait loop ("Example 11-4. Spin-wait Loop and PAUSE Instruction").
However, as for the UMONITOR/UMWAIT instruction, there are only mentions without any example ("Support for user level low-power and low-latency spin-loop instructions UMWAIT/UMONITOR and TPAUSE", and "UMWAIT/UMONITOR/TPAUSE instructions enable power savings in user level spin loops"). These new instructions came out with the Tremont Microarchitecture, released by Intel on September 2020 within the Elkhart Lake platform.
Can you please give a working example, on the assembly language, of SpinWaitAndAquireLock and ReleaseLock functions that use a byte value as a synchronization variable (in either 64-bit or 32-bit addressing mode), so that this byte can be only in one of the two states, i.e., contain just two values: cLockByteLocked
(e.g., 0FFh
) and cLockByteAvailable
(e.g., 0
)!? I can give such functions using the PAUSE instruction as a reference example, that can be compiled by nasm -f win64
:
section .text
; This code exposes two global functions: (1) SpinWaitAndAquireLock; (2) ReleaseLock.
cLockByteLocked equ 0FFh
cLockByteAvailable equ 0
; wait for the byte to not be locked and lock it
global SpinWaitAndAquireLock
; release lock obtained previously by SpinWaitAndAquireLock
global ReleaseLock
SpinWaitAndAquireLock:
; input: rcx - address of the synchronization variable lock byte
mov eax, cLockByteLocked ; load the value that denotes "locked" into al; use eax to clear remaining bits to avoid false dependency on existing eax value, bits 32-63 of rax are also cleared
jmp @FirstCompare
@NormalLoadLoop:
pause
@FirstCompare:
; use the "test, test-and-set" technique
cmp [rcx], al ; try normal load first instead of the locked load - "test" is a first part of "test, test-and-set"
je @NormalLoadLoop ; for static branch prediction, jump backward means "likely"
lock xchg [rcx], al ; since normal load shows the synchronization variable is available, try locked load and store ("test-and-set")
cmp al, cLockByteLocked
je @NormalLoadLoop
ret
ReleaseLock:
; input: rcx - address of the synchronization variable lock byte
mov byte [rcx], cLockByteAvailable ; just use normal store
ret