lock cmpxchg fails to execute threads in core order

Question

The following 64-bit NASM code uses lock cmpxchg to take each core in core order, execute some code, then reset the core number variable using xchg so the next core can execute the code. The core number for each core is stored in rbx -- the four cores are numbered 0, 8, 16 and 24. The variable [spin_lock_core] starts at zero and when each core is finished it updates the core number by 8 at the final line xchg [spin_lock_core],rax.

Spin_lock:
xor rax,rax
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock

; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx

; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
;jmp label_899

mov rax,rbx
add rax,8
xchg [spin_lock_core],rax

But before the code reaches xchg [spin_lock_core],rax the first core loops out of the program (jmp label_899), which should cause the other threads to freeze because they would be waiting for the [spin_lock_core] var to be updated, which never happens. But instead all four cores are written to the output array extra_test_array, which is displayed on the terminal when the program exits. In other words, this fails to stop the cores until the core number is updated.

The full, minimal code is below (as minimal as NASM can be in this case). The code is written for a shared object, and it's reproducible if it gets an input array (as written it doesn't matter if the input array is int or float):

; Header Section
[BITS 64]

[default rel]

global Main_Entry_fn
extern pthread_create, pthread_join, pthread_exit, pthread_self,    sched_getcpu
global FreeMem_fn
extern malloc, realloc, free
extern sprintf

section .data align=16
X_ctr: dq 0
data_master_ptr: dq 0
initial_dynamic_length: dq 0
XMM_Stack: dq 0, 0, 0, 0, 0, 0, 0
ThreadID: dq 0
X_ptr: dq 0
X_length: dq 0
X: dq 0
collect_ptr: dq 0
collect_length: dq 0
collect_ctr: dq 0
even_squares_list_ptrs: dq 0, 0, 0, 0
even_squares_list_ctr: dq 0
even_squares_list_length: dq 0
Number_Of_Cores: dq 32
pthread_attr_t: dq 0
pthread_arg: dq 0
Join_Ret_Val: dq 0
tcounter: dq 0
sched_getcpu_array: times 4 dq 0
ThreadIDLocked: dq 0
spin_lock_core: dq 0
extra_test_array: dq 0

; __________

section .text

Init_Cores_fn:

; _____
; Create Threads

label_0:

mov rdi,ThreadID            ; ThreadCount
mov rsi,pthread_attr_t  ; Thread Attributes
mov rdx,Test_fn         ; Function Pointer
mov rcx,pthread_arg
call pthread_create wrt ..plt

mov rdi,[ThreadID]      ; id to wait on
mov rsi,Join_Ret_Val        ; return value
call pthread_join wrt ..plt

mov rax,[tcounter]
add rax,8
mov [tcounter],rax
mov rbx,[Number_Of_Cores]
cmp rax,rbx
jl label_0

; _____

jmp label_900 ; All threads return here, and exit

; ______________________________________

Test_fn:

; Get the core number
call sched_getcpu wrt ..plt
mov rbx,8 ; multiply by 8
mul rbx
push rax

pop rax
mov rbx,rax
push rax

Spin_lock:
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock

; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx

; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
jmp label_899

mov rax,rbx
add rax,8
xchg [spin_lock_core],rax

;__________

label_899:

pop rax

ret

; __________

label_900:

mov rdi,extra_test_array ;audit_array
mov rax,rdi

ret

;__________
;Free the memory

FreeMem_fn:

;The pointer is passed back in rcx (of course)

sub rsp,40
call free wrt ..plt
add rsp,40
ret

; __________
; Main Entry


Main_Entry_fn:
push rdi
push rbp
push rbx
push r15
xor r15,r15
push r14
xor r14,r14
push r13
xor r13,r13
push r12
xor r12,r12
push r11
xor r11,r11
push r10
xor r10,r10
push r9
xor r9,r9
push r8
xor r8,r8
movsd [XMM_Stack+0],xmm13
movsd [XMM_Stack+8],xmm12
movsd [XMM_Stack+16],xmm11
movsd [XMM_Stack+24],xmm15
movsd [XMM_Stack+32],xmm14
movsd [XMM_Stack+40],xmm10
mov [X_ptr],rdi
mov [data_master_ptr],rsi
; Now assign lengths
lea rdi,[data_master_ptr]
mov rbp,[rdi]
xor rcx,rcx
movsd xmm0,qword[rbp+rcx]
cvttsd2si rax,xmm0
mov [X_length],rax
add rcx,8

; __________
; Write variables to assigned registers

mov r15,0
lea rdi,[rel collect_ptr]
mov r14,qword[rdi]
mov r13,[collect_ctr]
mov r12,[collect_length]
lea rdi,[rel X_ptr]
mov r11,qword[rdi]
mov r10,[X_length]

; __________

call Init_Cores_fn

movsd xmm10,[XMM_Stack+0]
movsd xmm14,[XMM_Stack+8]
movsd xmm15,[XMM_Stack+16]
movsd xmm11,[XMM_Stack+24]
movsd xmm12,[XMM_Stack+32]
movsd xmm13,[XMM_Stack+40]
pop r8
pop r9
pop r10
pop r11
pop r12
pop r13
pop r14
pop r15
pop rbx
pop rbp
pop rdi
ret

The instruction "lock cmpxchg" should fail until the [spin_lock_core] variable is updated, but it doesn't do that.

Thanks for any help in understanding why lock cmpxchg doesn't prevent the cores after core zero from firing in this area of code.

UPDATE: other research shows that xor rax,rax is needed at the top of the Spin_lock: section. When I insert that line, it reads like this:

Spin_lock:
xor rax,rax
lock cmpxchg [spin_lock_core],rbx
jnz Spin_lock

With that change it freezes, as expected. But when I remove the line jmp label_899 it still freezes, but it shouldn't do that.

EDIT 122219:

Based on the comments on this question yesterday, I revised the spinlock code to (1) eliminate atomic operations in favor of faster mov and cmp instructions, (2) assign a unique memory location to each core, and (3) separate the memory locations by > 256 bytes to avoid memory on the same cache line.

Each core's memory location will be changed to 1 when the previous core is finished. When each core finishes, it sets its own memory location back to 0.

The code successfully executes core 0 IF I have all other cores loop out before the spinlock. When I let all four cores run through the spinlock, the program again hangs.

I've verified that each separate memory location is set to 1 when the previous core is finished.

Here's the updated spinlock section:

section .data
spin_lock_core: times 140 dq 0
spin_lock_core_offsets: dq 0,264,528,792

section .text

; Calculate the offset to spin_lock_core
mov rbp,spin_lock_core
mov rdi,spin_lock_core_offsets
mov rax,[rdi+rbx]
add rbp,rax

; ________

Spin_lock:
pause
cmp byte[rbp],1
jnz Spin_lock

xor rax,rax
mov [rbp],rax ; Set current memory location to zero

; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rdx
mov rcx,rax

; Loop out if this is the last core
mov rax,rbx
add rax,8
cmp rax,[Number_Of_Cores]
jge label_899

; Set next core to 1 by adding 264 to the base address
add rbp,264
mov rax,1
mov [rbp],rax

Why does this code still hang?

Read the manual: https://www.felixcloutier.com/x86/cmpxchg says that CMPXCHG has an implicit input (and output on failure): RAX for the "expected" value. You mention this at the bottom, but you didn't update the rest of your question. What do you think your program *should* be doing? Have you tried stopping one thread with a debugger and single-stepping? — Peter Cordes, Dec 21 '19 at 20:39
I updated the source code above with xor rax,rax, and the line jmp label_899 is commented out. The program should execute core 0 while other cores wait, then core 0 updates using xchg [spin_lock_core],rax where rax is now 8 so core 8 will succeed at lock cmpxchg [spin_lock_core],rbx. Instead it freezes. — RTC222, Dec 21 '19 at 20:52
Perhaps there is a cache coherency problem - some research suggests cache issues with lock instructions and I'm investigating the atomics further. It's rather expensive so I may have to structure it differently -- for example, use a memory buffer with data for each core separated by at least 128 bytes and the buffer is updated for the next core as each core finishes its work. — RTC222, Dec 21 '19 at 20:52
I suspect that the code at the end needs an AND to prevent "spin_lock_core == 32" - e.g. `mov rax,rbx`, `add rax,8`, `and rax,24`, `xchg [spin_lock_core],rax`. The `xor rax,rax` when spinning is necessary. — Brendan, Dec 21 '19 at 20:54
Don't jump to the idea that there is a cache coherency problem. The problem is undoubtedly with your program logic. If [spin_lock_core] starts out as 0, and all cores clear rax and then execute the cmpxchg, then the cmpxchg will succeed on all cores. I think you want to initialize rax with `mov rax, rbx` in the Spin_lock loop. — prl, Dec 21 '19 at 20:58
If spin_lock_core gets to 32 it shouldn't matter because all cores are finished and will return. So and rax,24 just adds another instruction. — RTC222, Dec 21 '19 at 20:58
I don't see why you would want cmpxchg for this at all. All you need is `cmp [spin_lock_core], ebx` at the beginning, and `mov [spin_lock_core], rax` at the end. — prl, Dec 21 '19 at 21:00
It was originally written that way, and it does make most sense, but it wasn't consistently reliable. That could mean cache coherency, and that's why I suggested a separate memory location for each core separated by 128+ bytes, otherwise operating as you said where each core gets updated when the previous core is finished. — RTC222, Dec 21 '19 at 21:03
@prl "the cmpxchg will succeed on all cores", is that possible? The instruction is an atomic RMW one, at most two cores should see 0 in spin_lock_core. If core 0 is the first, it re-stores 0 in spin_lock_core and another core can still take it - setting it to a non zero value and stopping other cores. If core 0 is not the first then only one core should be able to take the spinlock. Maybe the OP misunderstood the other core for all the other cores? — Margaret Bloom, Dec 22 '19 at 13:12
@Margaret, yes that’s what I meant. I should have written “can succeed on any core”. — prl, Dec 22 '19 at 21:31
Regarding the latest edit, I’m still concerned that you’re using the hardware thread id instead of the software thread id, but I haven’t seen anything that constrains the software threads to each run on a different hardware thread. (I admit I might have missed it.) — prl, Dec 22 '19 at 21:31
As each thread runs on a separate core, I think the hardware thread should do it, and as I said yesterday the threads are always reliably 0, 8, 16 and 24. I am testing this on a cloud server and I'm about to test on a different cloud server at AWS to see if there could be some hardware issue. As written I don't see why it doesn't run. But as I'm running on a hypervisor it could be hardware vs software thread. — RTC222, Dec 22 '19 at 21:46

score 2 · Answer 1 · answered Dec 21 '19 at 21:03

2

I don't think you should use cmpxchg for this at all. Try this:

Spin_lock:
pause
cmp [spin_lock_core],rbx
jnz Spin_lock

; Test
mov rbp,extra_test_array
mov [rbp+rbx],rbx

; Execute some code before looping out
mov rax,1234
mov rdx,23435
add rax,rbx
mov rcx,rax
;jmp label_899

lea rax,[rbx+8]
mov [spin_lock_core],rax

answered Dec 21 '19 at 21:03

prl

11,716
2
13
31

I'll try that now and report back. – RTC222 Dec 21 '19 at 21:04
Your code does the same thing as my code -- the program hangs. I don't see why as written it should not work. I'm running it on an Intel Xeon Gold 6140 but that shouldn't matter. I suspect cache issues but I'm still not clear. – RTC222 Dec 21 '19 at 21:16
2

I'd like to know what sort of cache issues you are suspecting. Normally x86 is completely cache coherent and you don't have to worry about caching at all. – prl Dec 21 '19 at 21:22
1

Oh, I reread your description of the cache issues you have read about--that is totally concerned with performance, not with correctness. Having all your threads spin-locked on the same address causes a lot of cache coherency traffic, hence a performance issue, but it doesn't cause problems with your code working correctly (as long as the program logic is correct). Note, using cmp instead of cmpxchg alleviates this problem, because with cmp, all the cores can hold the line in shared state until it changes, whereas with cmpxchg, every core has to acquire the line exclusively on every loop. – prl Dec 21 '19 at 21:27
1

Are you sure that your thread id's really differ by 8? We know that they are 8 x the CPU number, but I don't know if we can assume that the CPU numbers that the 4 threads are running on are 0, 1, 2, and 3. – prl Dec 21 '19 at 21:36
In all the work I've done with this code the cores are reliably 0. 8, 16 and 24. But your comment makes me think I should consider a different approach to the core numbering. I'm doing some more debugging work with this. Performance is also an issue so I'm trying a different approach using cmp and mov instructions instead and I'll update later with my results. Thanks for your input. – RTC222 Dec 21 '19 at 21:48
1

Generally you need an atomic RMW in the acquire side of a lock. If you don't have that, maybe you're just handing on ownership to one single next thread. Then it's not a spinlock anymore; it's not open for any arbitrary thread to take it when it's unlocked. But this might be what @RTC222 is trying to actually do. – Peter Cordes Dec 21 '19 at 22:58
2

@Peter, agreed that he would need an atomic rmw if he wanted to allow *any* thread to go next, but since he trying to signal a *specific* thread to go next, he doesn’t need it. – prl Dec 21 '19 at 23:02
1

Ok. I haven't really been reading this question/answer or comments, just brief skimming. As long as we're clear that this isn't a spinlock anymore, but rather a handoff sequence. IMO change the label name. (But then we need a name... maybe `try_acquire`?) – Peter Cordes Dec 21 '19 at 23:04
Yes, we are accessing the lock in core order. I tested this with a write-to-file (for debugging) immediately below jnz Spin_lock after entering the spinlock on core 0. The program froze and no data were written, not even for core 0 which comes first. I believe that with all four cores spinning and reading on the same memory location (at high speed) causes the problem. The solution I'm working on is to use a separate memory location for each core, and perhaps eliminate atomics due to their latency. I'll post the results here later. – RTC222 Dec 21 '19 at 23:14
but I will use xchg to update the core number due to this comment regarding simple write operations: "Cores store to their write buffers and continue executing further instructions before the previous writes actually reach the cache. On the contrary, atomic operations cause the write buffers to be drained. That means that every atomic is affecting the cache directly without being merged or buffered." (Evaluating the Cost of Atomic Operations on Modern Architectures at https://spcl.inf.ethz.ch/Publications/.pdf/atomic-bench.pdf). – RTC222 Dec 21 '19 at 23:19
1

If you need thread 0 to wait until thread 1 has started executing the protected section, you need more than simply an xchg instruction to achieve that. If you don't need thread 0 to wait, then a mov instruction is sufficient. Using xchg won't accomplish anything. – prl Dec 21 '19 at 23:23
I don't need thread 0 to wait but with xchg or with a mov instruction the other cores continue to spin. My theory, as I said above, is that they are spinning on the same memory location. – RTC222 Dec 21 '19 at 23:35
I updated this question today based on work I did after these comments yesterday. Any comments on that edit will be very much appreciated. Thanks. – RTC222 Dec 22 '19 at 21:09

RTC222 · Answer 2 · 2019-12-23T17:39:33.573

I solved this spinlock problem, but after Peter Cordes' comment below I see that it is not correct. I won't delete this answer because I hope it can lead to the solution.

I use lock cmpxchg [rbp+rbx],rbx, which assembles without error, but the NASM assembler should return a "invalid combination of operands" error because the source operand can only be rax, so it shouldn't assemble with any other register. I also note that the online resources (for example, https://www.felixcloutier.com/x86/cmpxchg) show the format as CMPXCHG r/m64,r64, but the source operand can't be any r64 -- it must be rax, as that entry goes on to say.

Without the "mov rax,rbx" line it works because on the first iteration the rax register is set to 0 which matches the memory location. On the second iteration it succeeds by default.

When I add "mov rax,rbx" -- which resets rax -- the program once again hangs. I would really appreciate any ideas on why this program should hang as written.

At the start of this block rbx is the core number:

section .data
spin_lock_core: times 4 dq 0

section .text

[ Code leading up to this spinlock section shown above ]

mov rbp,spin_lock_core

Spin_lock:
pause
mov rax,rbx
lock cmpxchg [rbp+rbx],rax
jnz Spin_lock

mov rax,rbx
add rax,8
cmp rax,[Number_Of_Cores]
jge spin_lock_out

xchg [rbp+rax],rax

spin_lock_out:

The differences from my original post are:

Each core spins on (and reads from) its own unique memory location.
I use the "pause" instruction on the spinlock.
Each unique memory location is updated in core order.

But it does not work when I include mov rax,rbx. Intuitively that should work, so I will really appreciate any ideas on why it doesn't in this case.

Huh, what pre-conditions are there for calling this block of code? It doesn't set RAX before `lock cmpxchg`. So if it doesn't match the old contents, it loads and then retries, and will just succeed. I don't see what the point of using `lock cmpxchg` at all was if you don't actually care about the old contents; just do a store. (Or `xchg` as a sequential-consistency store). — Peter Cordes, Dec 23 '19 at 03:13
You are right. I revised my answer (it's more of a question now) to show what I did wrong. It hangs when I add the line mov rax,rbx but I don't understand why. — RTC222, Dec 23 '19 at 17:31
I ran strace on this process. The final line before it hangs is: "rt_sigaction(SIGINT, {sa_handler=0x55e71359f160, sa_mask=[], sa_flags=SA_RESTORE R, sa_restorer=0x7f17491d8f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f17491d8f20}, 8) = 0 wait4(-1," -- so it's waiting on sa_restorer. — RTC222, Dec 23 '19 at 19:35
According to https://stackoverflow.com/questions/42356368/memory-access-error-sys-rt-sigaction-signal-handler, "In x86-64 linux, it's mandatory to supply a sa_restorer." It looks like flags need to be reset but I am still investigating. — RTC222, Dec 23 '19 at 19:35
I'm debugging this with gdb. I'll post back with my results later. — RTC222, Dec 24 '19 at 00:26

lock cmpxchg fails to execute threads in core order

2 Answers2

Linked