Should std::atomic::load be doing a compare-and-swap loop?

Question

Summary: I had expected that std::atomic<int*>::load with std::memory_order_relaxed would be close to the performance of just loading a pointer directly, at least when the loaded value rarely changes. I saw far worse performance for the atomic load than a normal load on Visual Studio C++ 2012, so I decided to investigate. It turns out that the atomic load is implemented as a compare-and-swap loop, which I suspect is not the fastest possible implementation.

Question: Is there some reason that std::atomic<int*>::load needs to do a compare-and-swap loop?

Background: I believe that MSVC++ 2012 is doing a compare-and-swap loop on atomic load of a pointer based on this test program:

#include <atomic>
#include <iostream>

template<class T>
__declspec(noinline) T loadRelaxed(const std::atomic<T>& t) {
  return t.load(std::memory_order_relaxed);
}

int main() {
  int i = 42;
  char c = 42;
  std::atomic<int*> ptr(&i);
  std::atomic<int> integer;
  std::atomic<char> character;
  std::cout
    << *loadRelaxed(ptr) << ' '
    << loadRelaxed(integer) << ' '
    << loadRelaxed(character) << std::endl;
  return 0;
}

I'm using a __declspec(noinline) function in order to isolate the assembly instructions related to the atomic load. I made a new MSVC++ 2012 project, added an x64 platform, selected the release configuration, ran the program in the debugger and looked at the disassembly. Turns out that both std::atomic<char> and std::atomic<int> parameters end up giving the same call to loadRelaxed<int> - this must be something the optimizer did. Here is the disassembly of the two loadRelaxed instantiations that get called:

loadRelaxed<int * __ptr64>

000000013F4B1790  prefetchw   [rcx]  
000000013F4B1793  mov         rax,qword ptr [rcx]  
000000013F4B1796  mov         rdx,rax  
000000013F4B1799  lock cmpxchg qword ptr [rcx],rdx  
000000013F4B179E  jne         loadRelaxed<int * __ptr64>+6h (013F4B1796h)

loadRelaxed<int>

000000013F3F1940  prefetchw   [rcx]  
000000013F3F1943  mov         eax,dword ptr [rcx]  
000000013F3F1945  mov         edx,eax  
000000013F3F1947  lock cmpxchg dword ptr [rcx],edx  
000000013F3F194B  jne         loadRelaxed<int>+5h (013F3F1945h)

The instruction lock cmpxchg is atomic compare-and-swap and we see here that the code for atomically loading a char, an int or an int* is a compare-and-swap loop. I also built this code for 32-bit x86 and that implementation is still based on lock cmpxchg.

Question: Is there some reason that std::atomic<int*>::load needs to do a compare-and-swap loop?

@James I suspect that MS just has not yet had the time to implement this better. From my own implementation effort, it took only a tiny amount of code to make this faster, but it took a large amount of effort to understand in detail what that code was supposed to be doing and how that interacts with a given hardware platform. I have depended on material written by other people, but to be *really* sure that you did it right, I think it would be necessary to contact hardware vendors and to spend much time poring over the standard. Compare-and-swap is much easier to do and certainly correct. — Bjarke H. Roune, Feb 07 '13 at 12:53
See http://connect.microsoft.com/VisualStudio/feedback/details/770885/std-atomic-load-implementation-is-absurdly-slow — Stephan Tolksdorf, Mar 23 '13 at 22:59

Bjarke H. Roune · Accepted Answer · 2012-11-07T00:28:55.990

I do not believe that relaxed atomic loads require compare-and-swap. In the end this std::atomic implementation was not usable for my purpose, but I still wanted to have the interface, so I made my own std::atomic using MSVC's barrier intrinsics. This has better performance than the default std::atomic for my use case. You can see the code here. It's supposed to be implemented to the C++11 spec for all the orderings for load and store. Btw GCC 4.6 is not better in this regard. I don't know about GCC 4.7.

Should std::atomic::load be doing a compare-and-swap loop?

1 Answers1

Linked