Why are the atomics much slower than the lock in this uncontended case?

Question

I wrote something using atomics rather than locks and perplexed at it being so much slower in my case I wrote the following mini test:

#include <pthread.h>
#include <vector>

struct test
{
    test(size_t size) : index_(0), size_(size), vec2_(size)
        {
            vec_.reserve(size_);
            pthread_mutexattr_init(&attrs_);
            pthread_mutexattr_setpshared(&attrs_, PTHREAD_PROCESS_PRIVATE);
            pthread_mutexattr_settype(&attrs_, PTHREAD_MUTEX_ADAPTIVE_NP);

            pthread_mutex_init(&lock_, &attrs_);
        }

    void lockedPush(int i);
    void atomicPush(int* i);

    size_t              index_;
    size_t              size_;
    std::vector<int>    vec_;
    std::vector<int>    vec2_;
    pthread_mutexattr_t attrs_;
    pthread_mutex_t     lock_;
};

void test::lockedPush(int i)
{
    pthread_mutex_lock(&lock_);
    vec_.push_back(i);
    pthread_mutex_unlock(&lock_);
}

void test::atomicPush(int* i)
{
    int ii       = (int) (i - &vec2_.front());
    size_t index = __sync_fetch_and_add(&index_, 1);
    vec2_[index & (size_ - 1)] = ii;
}

int main(int argc, char** argv)
{
    const size_t N = 1048576;
    test t(N);

//     for (int i = 0; i < N; ++i)
//         t.lockedPush(i);

    for (int i = 0; i < N; ++i)
        t.atomicPush(&i);
}

If I uncomment the atomicPush operation and run the test with time(1) I get output like so:

real    0m0.027s
user    0m0.022s
sys     0m0.005s

and if I run the loop calling the atomic thing (the seemingly unnecessary operation is there because i want my function to look as much as possible as what my bigger code does) I get output like so:

real    0m0.046s
user    0m0.043s
sys     0m0.003s

I'm not sure why this is happening as I would have expected the atomic to be faster than the lock in this case...

When I compile with -O3 I see lock and atomic updates as follows:

lock:
    real    0m0.024s
    user    0m0.022s
    sys     0m0.001s

atomic:    
    real    0m0.013s
    user    0m0.011s
    sys     0m0.002s

In my larger app though the performance of the lock (single threaded testing) is still doing better regardless though..

Not sure what do you mean in the timing, in my test the lockedPush is consistently slower than atomicPush by ~70%. — kennytm, Sep 19 '12 at 16:03
Which is what I'd like to see! Do you have an SMP kernel? I read somewhere the attribute PTHREAD_MUTEX_ADAPTIVE_NP will make it spin which is super fast when not contended. Some system info of mine is: 2.6.32-220.13.1.el6.x86_64 #1 SMP Thu Mar 29 11:46:40 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux — Palace Chan, Sep 19 '12 at 16:08
3.5.4-1-ARCH #1 SMP PREEMPT Sat Sep 15 08:12:04 CEST 2012 x86_64 GNU/Linux. — kennytm, Sep 19 '12 at 16:09
Ah i'm confused then how you can be getting such results. If anything I'm assuming you're using a better compiler but then according to the below ("the memory barrier prevents compiler optimizations") I would also expect you to see my timings. — Palace Chan, Sep 19 '12 at 16:15
How are you compiling the program? I used `g++-4.7 -O3 -pthread`. — kennytm, Sep 19 '12 at 16:17
Try to add a cacheline of padding between index_ and vec2 and try again, maybe this is a cacheline artifact. — Christopher, Sep 19 '12 at 16:18
@KennyTM under -O3 I am seeing better timings for the atomicPush..interesting. — Palace Chan, Sep 19 '12 at 16:24
@Christopher oh what do you mean? Not sure we can get both in the same cache line, vector is large. — Palace Chan, Sep 19 '12 at 16:25
@Christopher: The test is actually single threaded, so there should be no false sharing, if that is why you are suggesting the padding — David Rodríguez - dribeas, Sep 19 '12 at 16:29
@PalaceChan No, I mean insert a byte[64] array after index_ to force size_ and vec2 to be in another cache line. But I don't really know if this really is a aliasing/false sharing artifact. — Christopher, Sep 19 '12 at 16:30
The test should be identical under the two circumstances except for the issue under test, which is lock versus atomic. Currently you are testing the cost of Lock + push_back versus atomic + vec[]. — Brian, Sep 19 '12 at 17:07
Well for one thing, you are not testing the same function. I agree with @brian about that. Testing with only one thread might not be a good test. — johnnycrash, Jul 15 '14 at 21:01

score 6 · Accepted Answer · answered Sep 19 '12 at 16:03

6

An uncontended mutex is extremely fast to lock and unlock. With an atomic variable, you're always paying a certain memory synchronisation penalty (especially since you're not even using relaxed ordering).

Your test case is simply too naive to be useful. You have to test a heavily contended data access scenario.

Generally, atomics are slow (they get in the way of clever internal reordering, pipelining, and caching), but they allow for lock-free code which ensures that the entire program can make some progress. By contrast, if you get swapped out while holding a lock, everyone has to wait.

answered Sep 19 '12 at 16:03

Kerrek SB

464,522
92
875
1,084

But doesn't an uncontended mutex have to satisfy memory visibility guarantees, too? – Pete Becker Sep 19 '12 at 16:07
Shouldn't a mutex be built on atomic primitives? – Brian Sep 19 '12 at 16:07
@PeteBecker: Sure, but for starters, those will have much more restrictive ordering requirements. For example, on x86 loads acquire and stores release in any case, so the mutex probably doesn't even need a fence... – Kerrek SB Sep 19 '12 at 16:09
1

@Brian: A mutex is more than just the locking primitive. It also contains kernel magic to allow *waiting* threads to sleep, and to wake up waiting threads upon unlock. Yes, the lock state has to be accessed atomically, but there's a lot more to a mutex than that. (The simplest lock that *only* uses an atomic flag is a *spin lock*.) – Kerrek SB Sep 19 '12 at 16:09
@KerrekSB - but that's also true for an atomic; on x86 it only needs a compiler barrier, not a memory barrier. Or am I thoroughly confused? – Pete Becker Sep 19 '12 at 16:11
@PeteBecker: If you demand sequential ordering (rather than acquire/release), you need a full fence. – Kerrek SB Sep 19 '12 at 16:12
the PTHREAD_MUTEX_ADAPTIVE_NP and the speed make me think it's spinning..but that's an atomic flag still right? And, I wonder how I could relax my barrier, multiple threads would be incrementing index_ which is why I put a __sync_fetch_and_add – Palace Chan Sep 19 '12 at 16:13
@PalaceChan: The pthread mutex is very magic. It spins for a bit, adaptively so, and only *then* goes on to do locking, futex style. – Kerrek SB Sep 19 '12 at 16:15
@KerrekSB and in this case the spin is a NOOP because of the lack of contention? – Palace Chan Sep 19 '12 at 16:17
Any lock cannot be a NOP, unless at compile-time it can be asserted that nothing else will ever synchronize with it. Thus at the minimum, the uncontended mutex here should be performing at least one atomic instruction in order to "know" that it is uncontended and thus acquired. Taking this back a level, there is also a timing issue in the original question in that the two sets of operations are not identical and the "expensive" part may be the push and not the lock. – Brian Sep 19 '12 at 17:02
@PalaceChan: it's not a no-op, but it's also not a system call (unlike in the contended case), and it's not a full memory barrier (like in your code). Again, this goes to show that you need a more real-life test case to appreciate the differences. – Kerrek SB Sep 19 '12 at 17:14
The mutex has got to be using atomic ops for synchronizing at the lowest levels. In addition, the mutex is tracking the locking thread etc etc. No way the mutex is faster. Just to make sure, I tested atomics vs mutex a few years ago on linux and atomics were 40x faster in my test. Of course I don't have the code any more to publish, but it was basically locking to increment a value vs compare and swap in a retry loop. – johnnycrash Jul 15 '14 at 21:05

score 1 · Answer 2 · answered Sep 19 '12 at 16:08

Just to add to the first answer, when you do a __sync_fetch_and_add you actually enforce specific code ordering. From the documentation

A full memory barrier is created when this function is invoked

A memory barrier is when

a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction

Chances are even though your work is atomic, you are losing compiler optimizations by forcing ordering of instructions.

See also http://gcc.gnu.org/onlinedocs/gcc-4.7.1/gcc/_005f_005fatomic-Builtins.html#_005f_005fatomic-Builtins — Hasturkun, Sep 19 '12 at 16:21

Why are the atomics much slower than the lock in this uncontended case?

2 Answers2

Linked