Why is std::mutex faster than std::atomic?

Question

I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex. I see that the second approach is faster than the first one. Why?

I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.

#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>

int main(int argc, char* argv[]) {
    const size_t size = 1000000;
    std::vector<int> first_result(size);
    std::vector<int> second_result(size);
    std::atomic<bool> sync(false);

    {
        auto start_time = std::chrono::high_resolution_clock::now();
        #pragma omp parallel for schedule(static, 1)
        for (int counter = 0; counter < size; counter++) {
            while(sync.exchange(true)) {
                std::this_thread::yield();
            };
            first_result[counter] = counter;
            sync.store(false) ;
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
    }

    {
        auto start_time = std::chrono::high_resolution_clock::now();
        std::mutex mutex; 
        #pragma omp parallel for schedule(static, 1)
        for (int counter = 0; counter < size; counter++) {
            std::unique_lock<std::mutex> lock(mutex);       
            second_result[counter] = counter;
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
    }

    return 0;
}

1. Post your compiler, compilation options & measuring results, please. 2. Do something observable with the resulting data after you measure, otherwise a good-enough optimiser can remove code as dead. — Angew is no longer proud of SO, Apr 09 '15 at 08:46
In a 32-bit release build with Visual Studio 2013 I get 0, 46800 and 64-bit gives me 0, 62400 consistently so it would seem atomic is either super fast, or the test harness isn't really working. You should also know, in case you're using it, that in Visual Studio 2013 and below `high_resolution_clock` isn't any different than `system_clock`. http://stackoverflow.com/q/16299029/920069 — Retired Ninja, Apr 09 '15 at 08:54
This code is badly broken regardless. Atomic operations with `memory_order_relaxed` are not synchronization operations. — T.C., Apr 09 '15 at 11:00
I updated my code. Now when I use four threads the first solution is faster than the second one (25-30%). But the first solution is slower than the second one if I increase number of threads (20-25%). — Sergey Malashenko, Apr 09 '15 at 11:42
Who cares. The code is still broken. What conclusions do you think you can draw? Broken code is faster? Broken code is slower? How is any of those useful? — R. Martinho Fernandes, Apr 09 '15 at 11:45
Here https://gcc.gnu.org/ml/gcc-help/2013-10/msg00115.html I found that std::this_thread::yield() doesn't work properly. So it is the main problem in my code — Sergey Malashenko, Apr 12 '15 at 20:13
I'm curious why the vectors themselves aren't just made atomic? Rather than using an atomic cool. — johnbakers, May 05 '16 at 17:55

Mateusz Grzejek · Accepted Answer · 2018-03-13T10:48:52.103

I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.

Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.

Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).

Custom spin-lock could be implemented this way:

class SpinLock
{
private:
    std::atomic_flag _lockFlag;

public:
    SpinLock()
    : _lockFlag {ATOMIC_FLAG_INIT}
    { }

    void lock()
    {
        while(_lockFlag.test_and_set(std::memory_order_acquire))
        { }
    }

    bool try_lock()
    {
        return !_lockFlag.test_and_set(std::memory_order_acquire);
    }

    void unlock()
    {
        _lockFlag.clear();
    }
};

Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section, that works in per-process basis and Mutex, which doesn't have such limitation.

Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.

Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.

In your case, mutex is locked in each loop iteration to perform this instruction:

second_result[counter] = omp_get_thread_num();

It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).

Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.

I edited the code to be more compact and correct. It uses the std::atomic_flag, which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).

Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:

void lock()
{
    for( unsigned k = 0; !try_lock(); ++k )
    {
        boost::detail::yield( k );
    }
}

source: boost/smart_ptr/detail/spinlock_std_atomic.hpp

Where detail::yield() is (Win32 version):

inline void yield( unsigned k )
{
    if( k < 4 )
    {
    }
#if defined( BOOST_SMT_PAUSE )
    else if( k < 16 )
    {
        BOOST_SMT_PAUSE
    }
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
    else if( k < 32 )
    {
        Sleep( 0 );
    }
    else
    {
        Sleep( 1 );
    }
#else
    else
    {
        // Sleep isn't supported on the Windows Runtime.
        std::this_thread::yield();
    }
#endif
}

[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]

First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!

Also, this statement:

The purpose of a spinlock is busy waiting

is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock()):

In the spin-wait loop, the pause intrinsic improves the speed at which the code detects the release of the lock and provides especially significant performance gain.

So, implementations like void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); } may not be as good as it seems.

That's not a spinlock. The purpose of a spinlock is busy waiting and explicitly NOT yielding. — Kaiserludi, May 03 '16 at 15:39
You should've used the preexisting `std::atomic_flag` class for this. [This is how a "proper" spin-lock should look](https://github.com/bit2shift/r3dVoxel/blob/master/inc/r3dVoxel/util/spin_lock.hpp). — bit2shift, Jan 02 '17 at 17:59
@Kaiserludi This may or may *not* be true. I updated the answer to address your comment. Same for @bit2shift - *your* implementation of spinlock may not be "proper" in every case. For example, Boost uses very nice custom yield strategy inside `lock()` to optimize performance of its spinlock implementation. Regarding `std::atomic_lock` - I've updated the code. It is indeed the only type that is guaranteed to be lock-free, so it is a natural choice when writing custom spinlock. — Mateusz Grzejek, Jan 23 '18 at 10:25
@bit2shift Sorry, but this is how a spin-lock should not look. Spinning on an operation that requires the exclusive state of the cache line is very inefficient. It is, for example, discussed here: https://en.wikipedia.org/wiki/Spinlock#Significant_optimizations or here: https://rigtorp.se/spinlock/. — Daniel Langr, May 13 '21 at 10:53

Daniel Langr · Answer 2 · 2021-12-01T08:18:06.790

There is an additional important issue related to your problem. An efficient spinlock never "spins" on an operation that involves (even potential) modification of a memory location (such as exchange or test_and_set). On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time). Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.

A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++

Why is std::mutex faster than std::atomic?

2 Answers2