CMPXCHG16B correct?

Question

This doesn't exactly seem to be right although I am unsure why. Advice would be great as the documentation for CMPXCHG16B is pretty minimal (I don't own any intel manuals...)

template<>
inline bool cas(volatile types::uint128_t *src, types::uint128_t cmp, types::uint128_t with)
{
    /*
    Description:
     The CMPXCHG16B instruction compares the 128-bit value in the RDX:RAX and RCX:RBX registers 
     with a 128-bit memory location. If the values are equal, the zero flag (ZF) is set, 
     and the RCX:RBX value is copied to the memory location. 
     Otherwise, the ZF flag is cleared, and the memory value is copied to RDX:RAX.
     */
    uint64_t * cmpP = (uint64_t*)&cmp;
    uint64_t * withP = (uint64_t*)&with;
    unsigned char result = 0;
    __asm__ __volatile__ (
    "LOCK; CMPXCHG16B %1\n\t"
    "SETZ %b0\n\t"
    : "=q"(result)  /* output */ 
    : "m"(*src), /* input */
      //what to compare against
      "rax"( ((uint64_t) (cmpP[1])) ), //lower bits
      "rdx"( ((uint64_t) (cmpP[0])) ),//upper bits
      //what to replace it with if it was equal
      "rbx"( ((uint64_t) (withP[1])) ), //lower bits
      "rcx"( ((uint64_t) (withP[0]) ) )//upper bits
    : "memory", "cc", "rax", "rdx", "rbx","rcx" /* clobbered items */
    );
    return result;
}

When running with an example I am getting 0 when it should be 1. Any ideas?

score 13 · Accepted Answer · edited Jan 17 '19 at 22:03

Noticed a few issues,

(1) The main problem is the constraints, "rax" doesn't do what it looks like, rather the first character "r" lets gcc use any register.

(2) Not sure how your storing types::uint128_t, but assuming the standard little endian for x86 platforms, then the high and low dwords are also swapped around.

(3) Taking the address of something and casting it to something else can break aliasing rules. Depends on how your types::uint128_t is defined as to wether or not this is an issue (fine if it is a struct of two uint64_t's). GCC with -O2 will optimize assuming aliasing rules are not violated.

(4) *src should really be marked as an output, rather than specifying memory clobber. but this is really more of a performance rather than correctness issue. similarly rbx and rcx do not need to specified as clobbered.

Here is a a version that works,

#include <stdint.h>

namespace types
{
    // alternative: union with  unsigned __int128
    struct uint128_t
    {
        uint64_t lo;
        uint64_t hi;
    }
    __attribute__ (( __aligned__( 16 ) ));
}

template< class T > inline bool cas( volatile T * src, T cmp, T with );

template<> inline bool cas( volatile types::uint128_t * src, types::uint128_t cmp, types::uint128_t with )
{
    // cmp can be by reference so the caller's value is updated on failure.

    // suggestion: use __sync_bool_compare_and_swap and compile with -mcx16 instead of inline asm
    bool result;
    __asm__ __volatile__
    (
        "lock cmpxchg16b %1\n\t"
        "setz %0"       // on gcc6 and later, use a flag output constraint instead
        : "=q" ( result )
        , "+m" ( *src )
        , "+d" ( cmp.hi )
        , "+a" ( cmp.lo )
        : "c" ( with.hi )
        , "b" ( with.lo )
        : "cc", "memory" // compile-time memory barrier.  Omit if you want memory_order_relaxed compile-time ordering.
    );
    return result;
}

int main()
{
    using namespace types;
    uint128_t test = { 0xdecafbad, 0xfeedbeef };
    uint128_t cmp = test;
    uint128_t with = { 0x55555555, 0xaaaaaaaa };
    return ! cas( & test, cmp, with );
}

I copied and pasted your code, and when compiling with "g++-4.7 -g -DDEBUG=1 -std=c++0x -pthread dwcas.c -o dwcas.o -ldl -lpthread" I get dwcas.c:29: Error: junk `ptr ' after expression. any ideas why? — Steven Feldman, Sep 17 '13 at 02:26
It should just be `lock cmpxchg16b %1` . The size in this case isn't needed since it is implied by the instruction `cmpxchg16b` . The use of `oword ptr` would suggest to me you were thinking this was _MASM_ assembler which will not assemble with GNU assembler. — Michael Petch, Jun 09 '16 at 19:09
I know this question is old, but there are things people should consider before using this answer. You should *seriously* [consider](https://gcc.gnu.org/wiki/DontUseInlineAsm) avoiding inline asm. Among other reasons, it's *really* hard to get right. As an example, this code almost certainly needs a "memory" clobber. Not to ensure the correct functioning of `cmpxchg16b`, but because the goal is probably to sync threads, which means you want all registers flushed *before* notifying the other thread. Consider the `__sync*` answer below, std::atomic or any other solution before going with asm. — David Wohlferd, Jun 10 '16 at 02:55

score 6 · Answer 2 · answered Jan 28 '11 at 07:54

6

All Intel documentation is available for free: Intel® 64 and IA-32 Architectures Software Developer's Manuals.

answered Jan 28 '11 at 07:54

Tomek

4,554
1
19
19

Thanks I'll check that out. I thought it was paid only. Guess not. – JH. Jan 28 '11 at 18:48

score 3 · Answer 3 · answered Aug 10 '14 at 04:54

3

It's good to note that if you're using GCC, you don't need to use inline asm to get at this instruction. You can use one of the __sync functions, like:

template<>
inline bool cas(volatile types::uint128_t *src,
                types::uint128_t cmp,
                types::uint128_t with)
{
    return __sync_bool_compare_and_swap(src, cmp, with);
}

Microsoft has a similar function for VC++:

__int64 exchhi = __int64(with >> 64);
__int64 exchlo = (__int64)(with);

return _InterlockedCompareExchange128(a, exchhi, exchlo, &cmp) != 0;

answered Aug 10 '14 at 04:54

Cory Nelson

29,236
5
72
110

Interesting, gcc7 and later changed to always calling libatomic for `__atomic_compare_exchange` for 16-byte CAS, but the deprecated `__sync` builtin will still inline `lock cmpxchg16b` if you compile with `-mcx16`. (It's not baseline for x86-64; unfortunately early K8 CPUs didn't have it.). https://godbolt.org/z/4bbfm6 has this and a version using `__atomic`, and std::atomic. `__sync_bool_compare_and_swap` doesn't update `cmp` if the compare fails, unlike ISO C++11 `std::atomic::compare_exchange_weak` (or `..._strong`, they're the same difference on x86-64). – Peter Cordes Jan 17 '19 at 21:51

score 1 · Answer 4 · edited May 13 '20 at 16:26

Here are some alternatives compared:

Inline assembly, such as @luke h's answer.
__sync_bool_compare_and_swap(): a GNU extension, gcc/clang/ICC-only, deprecated, pseudo-function that the compiler will issue a CMPXCHG16B instruction for at least with -mcx16
atomic_compare_exchange_weak() / strong: the C11 pseudo-function that does what atomic<> does in C++11. For GNU, this will NOT issue a CMPXCHG16B in gcc 7 and later, but instead calls libatomic (which must therefore be linked to). Dynamically linked libatomic will decide what version of functions to use based on what the CPU is capable of, and on machines where the CPU is capable of CMPXCHG16B it will use that.
apparently clang will still inline CMPXCHG16B for atomic_compare_exchange_weak() or strong

I haven't tried the machine language, but looking at the disassembly of (2), it looks perfect and I don't see how (1) could beat it. (I have little knowledge of x86 but programmed a lot of 6502.) Further, there's ample advice never to use assembly if you can avoid it, and it can be avoided at least with gcc/clang. So I can cross (1) off the list.

Here's the code for (2) in gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC):

Thread 2 "mybinary" hit Breakpoint 1, MyFunc() at myfile.c:586
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407262 <MyFunc+904>:       49 89 c2        mov    %rax,%r10
   0x0000000000407265 <MyFunc+907>:       49 89 d3        mov    %rdx,%r11
(gdb) n
587                                                  was, new ) ) {
=> 0x0000000000407268 <MyFunc+910>:       48 8b 45 a0     mov    -0x60(%rbp),%rax
   0x000000000040726c <MyFunc+914>:       48 8b 55 a8     mov    -0x58(%rbp),%rdx
(gdb) n
586               if ( __sync_bool_compare_and_swap( &myvar,
=> 0x0000000000407270 <MyFunc+918>:       48 c7 c6 00 d3 42 00    mov    $0x42d300,%rsi
   0x0000000000407277 <MyFunc+925>:       4c 89 d3        mov    %r10,%rbx
   0x000000000040727a <MyFunc+928>:       4c 89 d9        mov    %r11,%rcx
   0x000000000040727d <MyFunc+931>:       f0 48 0f c7 8e 70 04 00 00      lock cmpxchg16b 0x470(%rsi)
   0x0000000000407286 <MyFunc+940>:       0f 94 c0        sete   %al

Then doing hammer tests of (2) and (3) on real-world algos, I am seeing no real performance difference. Even in theory, (3) only has the overhead of one additional function call and some work in the libatomic wrapper function, including a branch on whether the CAS succeeded or not.

(With lazy dynamic linking, the first call into the libatomic function will actually run an init function that uses CPUID to check if your CPU has cmpxchg16b. Then it will update the GOT function-pointer that the PLT stub jumps through, so future calls will go straight to libat_compare_exchange_16_i1 that uses lock cmpxchg16b. The i1 suffix in the name is from GCC's ifunc mechanism for function multi-versioning; if you ran it on a CPU without cmpxchg16b support, it would resolve the shared library function to a version that used locking.)

In my real-world-ish hammer tests, that function call overhead is lost in the amount of CPU taken by the functionality the lock-free mechanism is protecting. Therefore, I don't see a reason to use the __sync, functions, that are compiler-specific and deprecated to boot.

Here's the assembly for the libatomic wrapper that gets called for each .compare_exchange_weak(), from single-stepping through the assembly on my Fedora 31. If compiled with -fno-plt, a callq *__atomic_compare_exchange_16@GOTPCREL(%rip) will get inlined into the caller, avoiding the PLT and running the CPU-detection early, at program startup time instead of on the first call.

Thread 2 "tsquark" hit Breakpoint 2, 0x0000000000403210 in 
__atomic_compare_exchange_16@plt ()
=> 0x0000000000403210 <__atomic_compare_exchange_16@plt+0>:     ff 25 f2 8e 02 00       jmpq   *0x28ef2(%rip)        # 0x42c108 <__atomic_compare_exchange_16@got.plt>
(gdb) disas
Dump of assembler code for function __atomic_compare_exchange_16@plt:
=> 0x0000000000403210 <+0>:     jmpq   *0x28ef2(%rip)        # 0x42c108 <__atomic_compare_exchange_16@got.plt>
   0x0000000000403216 <+6>:     pushq  $0x1e
   0x000000000040321b <+11>:    jmpq   0x403020
End of assembler dump.
(gdb) s
Single stepping until exit from function __atomic_compare_exchange_16@plt,
...

0x00007ffff7fab250 in libat_compare_exchange_16_i1 () from /lib64/libatomic.so.1
=> 0x00007ffff7fab250 <libat_compare_exchange_16_i1+0>: f3 0f 1e fa     endbr64
(gdb) disas
Dump of assembler code for function libat_compare_exchange_16_i1:
=> 0x00007ffff7fab250 <+0>:     endbr64
   0x00007ffff7fab254 <+4>:     mov    (%rsi),%r8
   0x00007ffff7fab257 <+7>:     mov    0x8(%rsi),%r9
   0x00007ffff7fab25b <+11>:    push   %rbx
   0x00007ffff7fab25c <+12>:    mov    %rdx,%rbx
   0x00007ffff7fab25f <+15>:    mov    %r8,%rax
   0x00007ffff7fab262 <+18>:    mov    %r9,%rdx
   0x00007ffff7fab265 <+21>:    lock cmpxchg16b (%rdi)
   0x00007ffff7fab26a <+26>:    mov    %r9,%rcx
   0x00007ffff7fab26d <+29>:    xor    %rax,%r8
   0x00007ffff7fab270 <+32>:    mov    $0x1,%r9d
   0x00007ffff7fab276 <+38>:    xor    %rdx,%rcx
   0x00007ffff7fab279 <+41>:    or     %r8,%rcx
   0x00007ffff7fab27c <+44>:    je     0x7ffff7fab288 <libat_compare_exchange_16_i1+56>
   0x00007ffff7fab27e <+46>:    mov    %rax,(%rsi)
   0x00007ffff7fab281 <+49>:    xor    %r9d,%r9d
   0x00007ffff7fab284 <+52>:    mov    %rdx,0x8(%rsi)
   0x00007ffff7fab288 <+56>:    mov    %r9d,%eax
   0x00007ffff7fab28b <+59>:    pop    %rbx
   0x00007ffff7fab28c <+60>:    retq
End of assembler dump.

The only benefit I've found to using (2) is if your machine didn't ship with a libatomic (true of older Red Hat's) and you don't have the ability to request sysadmins provide this or don't want to count on them installing the right one. I personally downloaded one in source code and built it incorrectly, so that 16-byte swaps ended up using a mutex: disaster.

I haven't tried (4). Or rather I started to but so many warnings/errors on code that gcc passed without comment that I wasn't able to get it compiled in the budgeted time.

Note that while options 2, 3, and 4 seem like the same code or nearly the same code should work, in fact all three have substantially different checks and warnings and even if you have one of the three compiling fine and without warning in -Wall, you may get a lot more warnings or errors if you try one of the other options. The __sync* pseudo-functions aren't well-documented. (In fact the documentation only mentions 1/2/4/8 bytes, not that they work for 16 bytes. Meanwhile, they work "kind of" like function templates but you don't see the template and they seem finicky about whether the first and second arg types actually are the same type in a way that atomic_* aren't.) In short it wasn't the 3 minute job to compare 2, 3, and 4 that you might guess.

Function-call overhead is more significant in the *uncontended* case when the same thread is doing atomic RMWs on something that's already hot in its L1d cache in Exclusive or Modified state. A full memory barrier and the microcoded `lock cmpxchg16b` is still slow compared to some overhead in the libatomic function but I could certainly imagine a careless use of `atomic<__uint128_t>` or struct load or store, or atomic RMW retry loop, being measurably slower when gcc >= 7 doesn't inline. — Peter Cordes, May 13 '20 at 02:01
BTW, a `libatomic` that does know how to use `lock cmpxchg16b` doesn't use it unconditionally, AFAIK. I think it checks CPUID at dynamic link time, using the same mechanism that glibc uses to pick optimized versions of `memset` / `memcpy` etc. based on CPU capabilities. `objdump -drwC -Mintel /usr/lib/libatomic.so` shows there's `libat_compare_exchange_16` using a lock vs. `libat_compare_exchange_16_i1` using cx16. I edited the bullet point in your answer but not the later part. — Peter Cordes, May 13 '20 at 02:03
Thx as always @PeterCordes and I've added the assembly for (2) and (3), which shows the CPUID feature your comment mentions. We were editing same time so I backed off to read your edits first before redoing mine. — Swiss Frank, May 13 '20 at 02:23
re: edit: The function that uses `cpuid` runs once, the first time the function is called (or at early-binding time if you compile with `-fno-plt` instead of the default PLT + lazy dynamic linking). That's the custom resolver function I was talking about, same as for `memset` or whatever. It updates the GOT entry to point at the selected version of the function. Future calls go straight to the `cmpxchg16b` version. `cpuid` is *very* slow compared to an uncontended CAS (~100 to 200 cycles and fully serializing), definitely something you need to avoid in the fast path. — Peter Cordes, May 13 '20 at 02:26
_It updates the GOT entry to point at the selected version of the function_ Interesting! I don't know enough x86 to see that in the code but sounds like they made it as good as they could. Of course my _first_ preference would be that they simply generate the CMPXCHG16B inline, if the `-m` architecture allows, and I don't know why they don't do that. Any ideas? Granted my hammer test shows it really is just an academic difference, but academic differences can add up. — Swiss Frank, May 13 '20 at 02:34
I haven't tried to follow the rats nest of lazy dynamic linker code; it's at least 10k instructions IIRC. What you can do is step into the *second* call to the function and see that the PLT jump goes straight to the real function. — Peter Cordes, May 13 '20 at 02:42
Not inlining `lock cmpxchg16b` is a side effect of not wanting to report it 16-byte atomic objects as "lock free", because that implies performance that isn't there (e.g. reads via `lock cmpxchg16b` contend with each other, but we expect lock-free atomics to have near-perfect read side scaling). Also it segfaults on read-only memory. So it's kind of a hack, and GCC maybe also wanted to open up the possibility of changing it at some point in the future. Not having it inlined anywhere means that libatomic can change without breaking ABI compat. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563 — Peter Cordes, May 13 '20 at 02:44
This answer is still super misleading about the normal `atomic_compare_exchange_weak` way. It implies it runs two `cpuid` instructions per `lock cmpxchg16` which would be a disaster. Please fix when you have time to single-step into the 2nd or later call and see the actual definition that gets used, or for now delete most of that section. There's overhead even including a conditional branch because stupid C can't return more than one value by-value, and the function takes an arg by reference in memory. — Peter Cordes, May 13 '20 at 05:03
I timed a single thread in a tight loop doing uncontended CAS on an `__int128` or `atomic<__int128>`, on my Skylake i7-6700k with GCC9.3. `__sync_bool_compare_and_swap` took about 263M cycles for 10M CASes, while `a128.compare_exchange_weak(a, 123);` took about 370M cycles for the same 10M CASes. Other 128-bit atomics like `.exchange()` are much worse, using mfence before and after a `lock cmpxchg16b` retry loop in libatomic, significantly hurting performance for the uncontended case (like 3x slower than `__sync`). — Peter Cordes, May 13 '20 at 14:12
_26.3 cycles vs. 37.0 cycles_. So let's say 30% faster? I think that might be significant for some users. For me it's really not. If the code being protected by this CAS takes 1000 cycles (including cache miss delays for data and instructions), then 1% improvement for this code, but 0.0% for app. I'm under pressure to move on, and sure you are too, but I bet many would be curious for apples-to-apples comparisons on your above numbers using the in-line assembly CAS (above) and the _non_-CMPXCHG16B implementation of `__atomic`. (GNU 4.8.6 had an atomic library for download like that.) — Swiss Frank, May 13 '20 at 16:01
Yeah, pretty significant for code that did multiple atomic modifications to some lock-free data structure, so the uncontended case was common. I assume the inline asm could get optimized into about the same as what GCC emits for `__sync_bool_compare_and_swap` or `__sync_val_compare_and_swap`. I doubt it's worth using inline asm unless you want both `val` and `bool` results from the same CAS. (checking for a change to `expected` leads to a few wasted instructions. Pretty minor.) — Peter Cordes, May 13 '20 at 16:07
I edited your answer to correct a mistake: Lazy dynamic linking doesn't rewrite *the jump itself*, it rewrites the function pointer used by that indirect jump in the PLT. This is how lazy dynamic linking always works, except that ifunc makes it run a custom resolver function to pick based on CPUID. Same as glibc does for memset / memcpy / strchr / strlen / etc. I don't think it's interesting or useful to show the asm for the resolver function; those "dozen instructions" are a tiny fraction of the total executed by dynamic linker code to find the resolver function. — Peter Cordes, May 13 '20 at 16:29
_pretty significant for code that did multiple atomic modifications to some lock-free data structure_ The simplest case I've come up with is using it for a "allocate but never free" memory pool, where the CAS is to store the result of like 5 lines of C. In that case a 50% slower CAS makes the allocation algo as a whole maybe 20% slower, but the app as a whole 0.00% slower. But even that algo was only of academic interest as an alternate implementation with thread_local pools doesn't even need CAS, and avoids false sharing too. Surely _someone_ will find `__sync` compelling but not me, yet. — Swiss Frank, May 14 '20 at 09:29

chamibuddhika · Answer 5 · 2014-08-10T04:47:33.377

I got it compiling for g++ with a slight change (removing oword ptr in cmpxchg16b instruction). ~~But it doesn't seem to overwrite the memory as required though I may be wrong.~~ [See update] Code is given below followed by output.

#include <stdint.h>
#include <stdio.h>

namespace types
{
  struct uint128_t
  {
    uint64_t lo;
    uint64_t hi;
  }
  __attribute__ (( __aligned__( 16 ) ));
 }

 template< class T > inline bool cas( volatile T * src, T cmp, T with );

 template<> inline bool cas( volatile types::uint128_t * src, types::uint128_t cmp,  types::uint128_t with )
 {
   bool result;
   __asm__ __volatile__
   (
    "lock cmpxchg16b %1\n\t"
    "setz %0"
    : "=q" ( result )
    , "+m" ( *src )
    , "+d" ( cmp.hi )
    , "+a" ( cmp.lo )
    : "c" ( with.hi )
    , "b" ( with.lo )
    : "cc"
   );
   return result;
}

void print_dlong(char* address) {

  char* byte_array = address;
  int i = 0;
  while (i < 4) {
     printf("%02X",(int)byte_array[i]);
     i++;
  }

  printf("\n");
  printf("\n");

}

int main()
{
  using namespace types;
  uint128_t test = { 0xdecafbad, 0xfeedbeef };
  uint128_t cmp = test;
  uint128_t with = { 0x55555555, 0xaaaaaaaa };

  print_dlong((char*)&test);
  bool result = cas( & test, cmp, with );
  print_dlong((char*)&test);

  return result;
}

Output

FFFFFFADFFFFFFFBFFFFFFCAFFFFFFDE


55555555

Not sure the output makes sense to me. I was expecting the before value to be something like 00000000decafbad00000feedbeef according to struct definition. But bytes seem to be spread out within words. Is that due to aligned directive? Btw the CAS operation seem to return the correct return value though. Any help in deciphering this?

Update : I just did some debugging with memory inspection with gdb. There the correct values are shown there. So I guess this must be a problem with my print_dlong procedure. Feel free to correct it. I am leaving this reply as it is to be corrected, since a corrected version of this would be instructive of the cas operation with printed results.

CMPXCHG16B correct?

5 Answers5

Linked