9

And how much faster/slower it is as compared to an uncontested atomic variable (such as std::atomic<T> of C++) operation.

Also, how much slower are contested atomic variables relative to the uncontested lock?

The architecture I'm working on is x86-64.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
pythonic
  • 20,589
  • 43
  • 136
  • 219
  • @KonradRudolph, I see the questions are similar but not exactly the same. This one is more focused on fundamental costs of operations whereas the other is the overhead cost of two approaches to an algorithm. I would actually answer them somewhat differently. – edA-qa mort-ora-y Jun 13 '12 at 11:47
  • @edA-qamort-ora-y As the author of the other question I can state that they are the same. The other question may be *phrased* differently (in terms of overhead) but what it was actually asking is “How much faster than a lock is an atomic operation?” – Konrad Rudolph Jun 13 '12 at 11:56

3 Answers3

15

I happen to have a lot of low-level speed tests lying around. However, what exactly speed means is very uncertain because it depends a lot on what exactly you are doing (even unrelated from the operation itself).

Here are some numbers from an AMD 64-Bit Phenom II X6 3.2Ghz. I've also run this on Intel chips and the times do vary a lot (again, depending on exactly what is being done).

A GCC __sync_fetch_and_add, which would be a fully-fenced atomic addition, has an average of 16ns, with a minimum time of 4ns. The minimum time is probably closer to the truth (though even there I have a bit of overhead).

An uncontested pthread mutex (through boost) is 14ns (which is also its minimum). Note this is also a bit too low, since the time will actually increase if something else had locked the mutex but it isn't uncontested now (since it will cause a cache sync).

A failed try_lock is 9ns.

I don't have a plain old atomic inc since on x86_64 this is just a normal exchange operation. Likely close to the minimum possible time, so 1-2ns.

Calling notify without a waiter on a condition variable is 25ns (if something is waiting about 304ns).

As all locks however cause certain CPU ordering guarantees, the amount of memory you have modified (whatever fits in the store buffer) will alter how long such operations take. And obviously if you ever have contention on a mutex that is your worst time. Any return to the linux kernel can be hundreds of nanoseconds even if no thread switch actually occurs. This is usually where atomic locks out-perform since they don't ever involve any kernel calls: your average case performance is also your worst case. Mutex unlocking also incurs an overhead if there are waiting threads, whereas an atomic would not.


NOTE: Doing such measurements is fraught with problems, so the results are always kind of questionable. My tests try to minimize variation by fixating CPU speed, setting cpu affinity for threads, running no other processes, and averaging over large result sets.

Jonathan Wakely
  • 166,810
  • 27
  • 341
  • 521
edA-qa mort-ora-y
  • 30,295
  • 39
  • 137
  • 267
  • Thanks for the numbers! Which platform did you test? saying "pthread mutex" doesn't say much, as what that means depends entirely on the implementation. As the time is close to an atomic add I'm assuming it's GNU/Linux, so using a futex? – Jonathan Wakely Jun 13 '12 at 18:24
  • Yes, on linux. Uncontested means it doesn't touch a system call though, thus the futex isn't actually involved in that case (non-contested in the NPTL library is resolved entirely in user-space with no system call). – edA-qa mort-ora-y Jun 13 '12 at 20:24
  • In my mind "the futex" _is_ the integer, so it's involved, but all that is needed is an atomic increment of "the futex" (i.e. the integer) – Jonathan Wakely Jun 13 '12 at 20:41
  • 1
    Atomic increment is not doable with `xchg` (even though that has an implicit `lock` prefix). `lock add [mem], 1` is almost exactly as expensive as `lock xadd [mem], eax` on most CPUs, only slightly simpler. It certainly won't be as fast as 1ns (3 clocks on a 3GHz CPU), the full barrier from the `lock` prefix doesn't block out-of-order execution of non-memory instructions. Agner Fog's instruction tables don't have `lock` numbers from K10, but Piledriver `lock add` is one per ~40 cycles (same as `xchg [mem],reg`) while `lock xadd` is one per ~39 cycles. – Peter Cordes May 03 '19 at 00:54
6

There’s a project on GitHub with the purpose of measuring this on different platforms. Unfortunately, after my master thesis I never really had the time to follow up on this but at least the rudimentary code is there.

It measures pthreads and OpenMP locks, compared to the __sync_fetch_and_add intrinsic.

From what I remember, we were expecting a pretty big difference between locks and atomic operations (~ an order of magnitude) but the real difference turned out to be very small.

However, measuring now on my system yields results which reflect my original guess, namely that (regardless of whether pthreads or OpenMP is used) atomic operations are about five times faster, and a single locked increment operation takes about 35ns (this includes acquiring the lock, performing the increment, and releasing the lock).

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • I think it can matter a *lot* whether you have high contention vs. low contention. Taking and releasing a lock, or x86 `lock add [mem], 1`, are both pretty fast if the cache line(s) (lock and data, or just the data for atomics) are still in MESI Modified or Exclusive state on the current core. But anyway, it's hard to microbenchmark because on some ISAs a weakly-ordered atomic increment (like std::memory_order_relaxed) avoids a memory barrier, the cost of which depends some on how many *other* loads/stores might be in flight and can't reorder. – Peter Cordes May 03 '19 at 00:45
  • IDK if your code on github has lots of threads doing nothing but hammering on the same variable trying to increment it, but that's usually not very realistic. If you had a real program that spent most of its time doing that, it would be a win to make it single-threaded. Anyway, lock-free RMW atomics are usually a bit faster than lock/unlock in the uncontended case (no function-call overhead, and a few less asm instructions), but can be *much* faster in the read-only case where readers never have to acquire a lock. – Peter Cordes May 03 '19 at 00:49
4

depends on the lock implementation, depends on the system too. Atomic variables can't be really be contested in the same way as a lock (not even if you are using acquire-release semantics), that is the whole point of atomicity, it locks the bus to propagate the store (depending on the memory barrier mode), but thats an implementation detail.

However, most user-mode locks are just wrapped atomic ops, see this article by Intel for some figures on high performance, scalable locks using atomic ops under x86 and x64 (compared against Windows' CriticalSection locks, unfortunately, no stats are to be found for the SWR locks, but one should always profile for ones own system/environment).

Necrolis
  • 25,836
  • 3
  • 63
  • 101
  • 4
    "Atomic variables can't be really be contested in the same way as a lock" -- if two threads (on different cores) hammer the same atomic variable, then that's contesting it, surely? It's then up to the architecture/implementation whether or not contesting actually slows things down. You could perhaps compare it with two threads on different cores hammering the same non-atomic variable, to get a feel for whether the atomic synchronization is in some sense taking any time. – Steve Jessop Jun 13 '12 at 09:54
  • 1
    @SteveJessop, definitely. Two cores using the same variable will cause excessive sync'ing of that variable. You're bound at this point by the latency/bandwidth of the cache bus. – edA-qa mort-ora-y Jun 13 '12 at 09:58
  • @SteveJessop: you could call it that, but, IMO, its done in a different manner all together, thus you can't really put it in the same category as spin-wait-retrying on an already acquired lock. – Necrolis Jun 13 '12 at 10:00
  • 1
    @edA-qamort-ora-y: and the issue is potentially confused on x86-alike architectures because of the coherent cache. So like you say, hammering the same location is a kind of contention even if it *isn't* an atomic variable. I'm not sure whether the questioner knows this, but I think it's a confounding factor if you set out to find out what "the cost" is of a contested atomic increment. You could compare it against atomic increments in a single thread, or against a contested non-atomic increment (aka a data race) and come up with very different ideas of what "atomic contention" costs. – Steve Jessop Jun 13 '12 at 10:01
  • @Necrolis: sure, the mechanism is completely different, but I think the questioner is right to call all such things "contention". If my code is delayed waiting for some other code to get out of the road, then we're contesting no matter what the mechanism :-) – Steve Jessop Jun 13 '12 at 10:01
  • @SteveJessop: If you look at it that way, yes. – Necrolis Jun 13 '12 at 10:11
  • @SteveJessop: all normal CPU architectures that run a single multi-threaded program have coherent caches using MESI (or a variant) that can only commit from the store buffer to L1d when the core has exclusive ownership of the line. (Exclusive / Modified state). x86 has a strongly-ordered memory model that puts more restrictions on the order stores can commit from the store buffer to L1d, but cache-line ping-pong is not at all specific to x86. You see it on ARM or PowerPC, too, even though PPC is very weakly ordered. – Peter Cordes May 03 '19 at 01:00
  • To program a non-coherent machine (e.g. something that's actually a cluster of multiple normal systems each with its own coherency domain), you usually use MPI for message passing. Or something similar to how graphics or OpenCL drivers pass data to GPUs, because GPU memory is not cache-coherent with the CPU in most systems with discrete GPUs or GPU-based compute cards. (Some iGPUs are coherent with CPU caches). – Peter Cordes May 03 '19 at 01:02