I happen to have a lot of low-level speed tests lying around. However, what exactly speed means is very uncertain because it depends a lot on what exactly you are doing (even unrelated from the operation itself).
Here are some numbers from an AMD 64-Bit Phenom II X6 3.2Ghz. I've also run this on Intel chips and the times do vary a lot (again, depending on exactly what is being done).
A GCC __sync_fetch_and_add
, which would be a fully-fenced atomic addition, has an average of 16ns, with a minimum time of 4ns. The minimum time is probably closer to the truth (though even there I have a bit of overhead).
An uncontested pthread mutex (through boost) is 14ns (which is also its minimum). Note this is also a bit too low, since the time will actually increase if something else had locked the mutex but it isn't uncontested now (since it will cause a cache sync).
A failed try_lock is 9ns.
I don't have a plain old atomic inc since on x86_64 this is just a normal exchange operation. Likely close to the minimum possible time, so 1-2ns.
Calling notify without a waiter on a condition variable is 25ns (if something is waiting about 304ns).
As all locks however cause certain CPU ordering guarantees, the amount of memory you have modified (whatever fits in the store buffer) will alter how long such operations take. And obviously if you ever have contention on a mutex that is your worst time. Any return to the linux kernel can be hundreds of nanoseconds even if no thread switch actually occurs. This is usually where atomic locks out-perform since they don't ever involve any kernel calls: your average case performance is also your worst case. Mutex unlocking also incurs an overhead if there are waiting threads, whereas an atomic would not.
NOTE: Doing such measurements is fraught with problems, so the results are always kind of questionable. My tests try to minimize variation by fixating CPU speed, setting cpu affinity for threads, running no other processes, and averaging over large result sets.