I program in C++ and use CAS operation for thread synchronization.
I profiled my program by using Vtune and found that a huge portion of time was spent on CAS operation.
I took a look at the assembly code.
The profiling result shows that the significant portion of time is being spent on 'movq %rax, (%rsi)', but not on 'lock cmpxchgq %rcx, (%rdi)'.
How is 'movq %rax, (%rsi)' opreation related to CAS operation? Which data is being moved by this operation?