Comparing latency of mutex vs. spinlock wake time

Question

I am curious about the latency of a mutex vs. spinlock, measured as the time between when one thread unlocks it and another waiting thread can access it. I wrote two tests in C++ and see ~15000 nanoseconds for mutex and ~300 nanoseconds for spinlocks. I am new to multithreading synchronization, so I would first like to verify that these tests are implemented correctly and that the results are reasonable.

I am also curious why the spinlock still shows a latency of ~1000 clock cycles, even though the atomic operations alone are fast (~100 clock cycles) and the spinning CPU should be able to pick up on the unlock as soon as it occurs. Are there any ways to speed this up?

The results are fairly consistent, and I checked that the time measurement itself has latency <60ns.

Mutex test

// contended_mutex_overhead.cpp

#include <iostream>
#include <chrono>
#include <mutex>
#include <thread>

uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

// Global clock
auto t = std::chrono::high_resolution_clock::now();
uint64_t cycles_count;

// Global mutex lock
std::mutex m;

int slowthread() {
  m.lock();

  // Do work that takes some time, so that main thread is waiting on lock
  int sum = 0;
  for (int i=0; i<1000000000; ++i) {
    sum += i;
  }

  // Mark start time, right before lock is returned
  t = std::chrono::high_resolution_clock::now();
  cycles_count = rdtsc();
  m.unlock();
  return sum;
}

int main() {
  std::thread thr(slowthread);

  // Delay the main() thread with a std::cout call to make sure slowthread() gets a chance
  // to grab the lock before main() requests it in the next line.
  std::cout << "hi\n";

  // The lock will not be available at this time, until slowthread's blocking work completes
  m.lock();

  // Measure the end time, and calculate the delta from slowthread's release of the lock to when lock is given to main()
  auto finish = std::chrono::high_resolution_clock::now();

  auto cycles2 = rdtsc();
  auto d = std::chrono::duration_cast<std::chrono::nanoseconds>(finish-t).count();
  std::cout << d << " nanosecond overhead for contended mutex wake\n";
  std::cout << cycles2 - cycles_count << " cycle overhead for contended mutex wake\n";
  thr.join();
}

Output:

> g++ -O3 -pthread contended_mutex_overhead.cpp; ./a.out
hi
14763 nanosecond overhead for contended mutex wake
42786 cycle overhead for contended mutex wake

Spinlock test

// contended_spinlock_overhead.cpp
#include <iostream>
#include <chrono>
#include <mutex>
#include <thread>
#include <atomic>

uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

struct spinlock {
  std::atomic<bool> lock_ = {false};

  void lock() {
    while(lock_.exchange(true));
  }

  void unlock() { lock_.store(false); }
};

// Global clock
auto t = std::chrono::high_resolution_clock::now();
uint64_t cycles_count;

// Global spinlock
spinlock spinner;

int slowthread() {
  spinner.lock();

  // Do work that takes some time, so that main thread is waiting on lock
  int sum = 0;
  for (int i=0; i<1000000000; ++i) {
    sum += i;
  }

  // Mark start time, right before lock is returned
  t = std::chrono::high_resolution_clock::now();
  cycles_count = rdtsc();
  spinner.unlock();
  return sum;
}

int main() {
  std::thread thr(slowthread);

  // Delay the main() thread with a std::cout call to make sure slowthread() gets a chance
  // to grab the lock before main() requests it.
  std::cout << "hi\n";

  // The lock will not be available at this time, until slowthread's blocking work completes
  spinner.lock();

  // Measure the end time, and calculate the delta from slowthread's release of the lock to when lock is given to main()
  auto finish = std::chrono::high_resolution_clock::now();

  auto cycles2 = rdtsc();
  auto d = std::chrono::duration_cast<std::chrono::nanoseconds>(finish-t).count();
  std::cout << d << " nanosecond overhead for spinlock wake\n";
  std::cout << cycles2 - cycles_count << " cycle overhead for spinlock wake\n";
  thr.join();
}

Output:

hi
363 nanosecond overhead for spinlock wake
1148 cycle overhead for spinlock wake

Impressive that you can get that good a resolution out of `high_resolution_clock` I'm used to it petering out somewhere in the microseconds. Be careful by the way, `high_resolution_clock` isn't guaranteed to be monotonic. Some implementations can jump backwards and forwards in time. — user4581301, May 06 '22 at 18:29
At the start **// The lock will not be available at this time, until slowthread's blocking work completes** is false. — 273K, May 06 '22 at 18:31
@273K can you elaborate on this? My understanding is that the lock will be occupied by slowthread(), so the main() thread will have to wait. If the lock *is* available to main() thread immediately, why do we observe the lag that is printed out? — rampatowl, May 06 '22 at 18:35
It's worth keeping in mind that Linux isn't a real-time operating system. There are kernel options to improve latency but there's always a chance that even a spinning thread will be pre-empted. One example is the timer interrupt which, depending on your kennel, fires 250 or 1000 times a second can cause unpredictable latency spikes. Cpu frequency scaling can also cause latency spikes — Alan Birtles, May 06 '22 at 18:35
Note that your `work that takes some time` is likely to be optimised away by the compiler — Alan Birtles, May 06 '22 at 18:40
These numbers are similar to what I measured some years ago in a message queue. The latency of messages was on average 15-30 us using mutexes and condition variables (some spikes a lot higher) and near 0 using spinlocks. I wasn't able to reduce the latency below that. — Federico, May 06 '22 at 18:47
Relevant: https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/ - TL;DR: Spinlocks are quite inconsistent because you're actively working against the operating system's scheduler. Some go so far as to say that spinlocks in usermode are broken by design. — Jonathan S., May 06 '22 at 19:19

Arty · Answer 1 · 2022-05-10T18:16:26.367

Spinlock and mutex are both MUCH more performant than you think if measurements are done correctly.

My following code on my old 2GHz Windows laptop shows 75 ns for mutex and 12.5 ns for spinlock. And on modern 3GHz GodBolt online Linux servers mutex shows on average 15 ns and spinlock shows 8 ns. See more details after code, there is located console output of measurements.

GodBolt Linux mutex is much faster, because in Linux is implemented in much more performant way than in Windows. On Windows you may try std::shared_mutex instead of std::mutex, because shared was implemented in much later version of Windows and was made through more performant algorithm and API, shared version of mutex shows 50 ns on my laptop, while regular mutex is 75 ns.

I also measure cycles and GHz that are shown in console output. Note that as cycles are measured by RDTSC instruction, this cycles count corresponds to Base frequence, it means if currently CPU has thermal throttling or turbo boost, i.e. has changed its speed from Base, then number of cycles will be shown incorrectly. Only number of nano-seconds (ns) is always shown correctly, corresponding to current CPU speed.

How to correctly measure everything here. First to ONLY measure time of spinlock and mutex itself without any extra work you need to measure only single thread.

But to measure only single thread you need to Show compiler that you're Going to use mutex and spinlock in other threads. For that I create a second dummy thread through std::async and keep its future in volatile variable. This dummy thread prevents compiler from optimizing away mutex and spinlock code, as it sees that it is used in other thread simultaneously. Without dummy thread mutex and spinlock code might be removed from single (main) thread.

Second dummy thread actually does nothing, it locks mutex and spinlock just once and quits. So this second thread will not disturb our measurements.

One more important thing is regarding std::atomic_flag - you have to use more relaxed memory order, specifically in my code you can see that I use std::memory_order_acquire in locking line of spinlock, and std::memory_order_release in release line. This tighter memory order will help to get more speed from spinlock, by removing unnecessary operations. By default a much more slower sequential-only memory order std::memory_order_seq_cst is used.

Next very important thing, is probably a main reason of wrong results in your code - you should do Many measurements and choose smallest one. Why smallest? Because if you measure for 1 second then during this period operating system (Windows/Linux/MacOS) will rotate (re-schedule) all threads many times, around 1 time in each 10-15 milli-second. This will cause very huge pause between each wake-ups of each thread, and spoil all our results.

To avoid this timings spoiling when threads are re-scheduled, we do two things - first we measure quite small amount of time, around 50-100 micro-seconds, second - we do this many times (10-15 times). For any operating system we're quite guaranteed that at least one 50-micro-seconds measurement will be correct, i.e. will not include re-schedule of threads.

100 micro seconds time is quite alright for precision of std::chrono::high_resolution_clock, it usually has a resolution of about 200-500 nano-seconds.

Taking mininmal time out of 15 measurements will help us to avoid other hardware delays. Because minimal time will show most performant result, almost this result will usually happen on average later in real systems.

I also use RDTSC to measure cycles, this is only to show in console besides nano-seconds ns, to show also cycles and GHz of CPU.

Tiny more improvement of speed is by repeating lock/unlock of both mutex and spinlock two times inside main measurement loop. This is a small manual loop-unrolling, to be even more precise and do less loop variable increments and comparisons.

Try it online on GodBolt!

#include <cstdint>
#include <atomic>
#include <mutex>
#include <future>
#include <iostream>
#include <iomanip>

#include <immintrin.h>

int main() {
    auto Rdtsc = []() -> uint64_t { return __rdtsc(); };
    auto Time = []() -> uint64_t {
        static auto const gtb = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(
            std::chrono::high_resolution_clock::now() - gtb).count();
    };
    std::cout << std::fixed << std::setprecision(2);
    size_t constexpr nloops = 1 << 4, ntests = 1 << 13;
    {
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::atomic_flag f = ATOMIC_FLAG_INIT;
        auto volatile ares = std::async(std::launch::async, [&]{ while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); });
        for (size_t j = 0; j < nloops; ++j) {
            auto tim = Time(), timr = Rdtsc();
            for (size_t i = 0; i < ntests / 2; ++i) {
                { while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); }
                { while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); }
            }
            tim = Time() - tim;
            timr = Rdtsc() - timr;
            if (tim < min_time) {
                min_time = tim;
                min_rdtsc = timr;
            }
        }
        std::cout << "Spinlock time: " << double(min_time) / ntests << " ns (" << double(min_rdtsc) / ntests
            << " cycles, " << double(min_rdtsc) / min_time << " GHz)" << std::endl;
    }
    {
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::mutex mux;
        auto volatile ares = std::async(std::launch::async, [&]{ std::lock_guard<std::mutex> lock(mux); });
        for (size_t j = 0; j < nloops; ++j) {
            auto tim = Time(), timr = Rdtsc();
            for (size_t i = 0; i < ntests / 2; ++i) {
                { std::lock_guard<std::mutex> lock(mux); }
                { std::lock_guard<std::mutex> lock(mux); }
            }
            tim = Time() - tim;
            timr = Rdtsc() - timr;
            if (tim < min_time) {
                min_time = tim;
                min_rdtsc = timr;
            }
        }
        std::cout << "Mutex time   : " << double(min_time) / ntests << " ns (" << double(min_rdtsc) / ntests
            << " cycles, " << double(min_rdtsc) / min_time << " GHz)" << std::endl;
    }
}

Output of modern 3GHz Linux GodBolt online server:

Spinlock time: 7.65 ns (22.95 cycles, 3.00 GHz)
Mutex time   : 15.59 ns (46.78 cycles, 3.00 GHz)

Output of old 2GHz Windows laptop:

Spinlock time: 12.38 ns (25.94 cycles, 2.10 GHz)
Mutex time   : 74.27 ns (155.58 cycles, 2.09 GHz)

(std::shared_mutex time for PC above is 50 ns)

One can make a program more complex by running multiple threads to measure how timings behave in a case if multiple threads race to get lock of same atomic flag or mutex.

Following program tests both single threaded and multi threaded timings. Plus it also does increment of a counter as a kind of spinlock/mutex-protected work.

To get precise results program sets affinity of a thread by assigning different threads to different cores.

Multi threaded version uses std::barrier to do start/stop of measurements at same precise points of time. Also it uses atomic counters to synchronize speed of each thread so that they advance at same rate.

Timings of multi threaded version should be almost equal to single threaded version, especially it will be close on modern CPUs. Timings of multi threaded can be MUCH bigger only if chosen (by affinity) cores are occupied too much by operating system. Also timings can be bigger if some operating system re-schedules threads in some strange order. It happened to me that Windows multi-threaded timings were big, while Linux timings where totally alright.

See timings after code. Note that GodBolt servers' timings are very tiny. Spinlock (both single and multi threaded) is 8 ns (23 cycles), which corresponds to timings of atomic flag's main instruction XCHG (17 cycles here). Mutex is 16 ns (50 cycles), almost same as spinlock, because in Linux it is implemented with atomic flag, extra overhead is due to syscall operation or CALL instruction.

Try it online!

#include <cstdint>
#include <atomic>
#include <mutex>
#include <future>
#include <iostream>
#include <iomanip>
#include <barrier>
#include <optional>
#include <thread>
#include <vector>

#include <immintrin.h>

#if defined(_MSC_VER) && !defined(__clang__)
    #define FINL [[msvc::forceinline]]
#else
    #define FINL __attribute__((always_inline))
#endif

#ifdef _WIN32
    #include <windows.h>
    // https://stackoverflow.com/a/56486809/941531
    inline void SetAffinity(std::thread & thr, size_t i) {
        if (SetThreadAffinityMask(thr.native_handle(), DWORD_PTR(1) << i) == 0)
            std::cout << "SetThreadAffinityMask failed, GLE = " << GetLastError() << std::endl;
    }
#else
    #include <pthread.h>
    // https://stackoverflow.com/a/57620568/941531
    inline void SetAffinity(std::thread & thr, size_t i) {
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(i, &cpuset);
        int rc = pthread_setaffinity_np(thr.native_handle(), sizeof(cpu_set_t), &cpuset);
        if (rc != 0)
            std::cout << "Error calling pthread_setaffinity_np: " << rc << std::endl;
    }
#endif

int main() {
    auto Rdtsc = []() -> uint64_t { return __rdtsc(); };
    auto Time = []() -> uint64_t {
        static auto const gtb = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(
            std::chrono::high_resolution_clock::now() - gtb).count();
    };
    std::cout << std::fixed << std::setprecision(2);
    
    auto RunST = [&Rdtsc, &Time](auto const & ReLock, auto nloops, auto ntests, auto & min_time, auto & min_rdtsc){
        { auto volatile ares = std::async(std::launch::async, [&]{ ReLock(); }); }
        for (size_t j = 0; j < nloops; ++j) {
            auto timr = Rdtsc(), tim = Time();
            for (size_t i = 0; i < ntests / 4; ++i) {
                ReLock(); ReLock(); ReLock(); ReLock();
            }
            tim = Time() - tim;
            timr = Rdtsc() - timr;
            if (tim < min_time) {
                min_time = tim;
                min_rdtsc = timr;
            }
        }
    };
    
    auto RunMT = [&Rdtsc, &Time](auto nthr, auto ithr, auto const & ReLock, auto nloops, auto ntests, auto & barrier, auto & timings, auto & min_time, auto & min_rdtsc, auto & brs){
        for (size_t j = 0; j < nloops; ++j) {
            barrier.arrive_and_wait();
            size_t nibr = (ithr + 1) % nthr;
            uint32_t ibrc = 0;
            auto ReSync = [&]{
                ibrc += 16;
                brs[ithr].store(ibrc, std::memory_order_relaxed);
                while (ibrc > brs[nibr].load(std::memory_order_relaxed)) {}
            };
            ReSync();
            auto timr = Rdtsc(), tim = Time();
            for (size_t i = 0; i < ntests / 16; ++i) {
                ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock();
                ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock();
                ReSync();
            }
            tim = Time() - tim;
            timr = Rdtsc() - timr;
            timings[ithr] = {tim, timr};
            barrier.arrive_and_wait();
            if (ithr != 0)
                continue;
            
            std::sort(timings.begin(), timings.end());
            if (timings.back().first >= timings.front().first * 1.3)
                continue;
            
            tim = 0; timr = 0;
            for (size_t i = 0; i < timings.size(); ++i) {
                tim += timings[i].first;
                timr += timings[i].second;
            }
            tim /= timings.size();
            timr /= timings.size();
            if (tim < min_time) {
                min_time = tim;
                min_rdtsc = timr;
            }
        }
    };

    auto const hard_threads = std::thread::hardware_concurrency();
    auto const test_threads = 2; //std::max<size_t>(2, hard_threads / 2);
    std::cout << test_threads << " threads." << std::endl;
    
    {
        uint64_t vcnt = 0;
        size_t constexpr ntries = 1 << 4, nloops = 1 << 8, ntests = 1 << 11;
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::atomic_flag f = ATOMIC_FLAG_INIT;
        auto const ReLock = [&f, &vcnt]() FINL {
            while (f.test_and_set(std::memory_order_acquire)) {}
            ++vcnt;
            f.clear(std::memory_order_release);
        };
        auto const stim = Time();
        for (size_t itry = 0; itry < ntries; ++itry)
            RunST(ReLock, nloops, ntests, min_time, min_rdtsc);
        std::cout << "Single-Threaded Spinlock time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
            << " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
            << std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
            << std::setprecision(8) << double(vcnt) / (ntries * (nloops * ntests + 1)) << std::endl;
    }
    
    {
        uint64_t vcnt = 0;
        size_t constexpr ntries = 1 << 4, nloops = 1 << 6, ntests = 1 << 11;
        size_t const nthr = test_threads;
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::atomic_flag f = ATOMIC_FLAG_INIT;
        std::barrier barrier(nthr, []() noexcept {});
        std::vector<std::pair<uint64_t, uint64_t>> timings(nthr);
        std::vector<std::optional<std::thread>> threads(nthr);
        auto const ReLock = [&f, &vcnt]() FINL {
            while (f.test_and_set(std::memory_order_acquire)) {}
            ++vcnt;
            f.clear(std::memory_order_release);
        };
        auto const stim = Time();
        for (size_t itry = 0; itry < ntries; ++itry) {
            std::vector<std::atomic<uint32_t>> brs(nthr);
            for (size_t ithr = 0; ithr < threads.size(); ++ithr) {
                threads[ithr] = std::thread([&, ithr]{ RunMT(nthr, ithr, ReLock, nloops, ntests, barrier, timings, min_time, min_rdtsc, brs); });
                SetAffinity(*threads[ithr], nthr * 2 <= hard_threads ? ithr * 2 : ithr);
            }
            for (auto & t: threads)
                t->join();
            threads.clear();
            threads.resize(nthr);
        }
        std::cout << "Multi -Threaded Spinlock time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
            << " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
            << std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
            << std::setprecision(8) << double(vcnt) / (ntries * nthr * nloops * ntests) << std::endl;
    }
    
    {
        uint64_t vcnt = 0;
        size_t constexpr ntries = 1 << 4, nloops = 1 << 8, ntests = 1 << 11;
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::mutex mux;
        auto const ReLock = [&mux, &vcnt]() FINL {
            std::lock_guard<std::mutex> lock(mux);
            ++vcnt;
        };
        auto const stim = Time();
        for (size_t itry = 0; itry < ntries; ++itry)
            RunST(ReLock, nloops, ntests, min_time, min_rdtsc);
        std::cout << "Single-Threaded Mutex    time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
            << " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
            << std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
            << std::setprecision(8) << double(vcnt) / (ntries * (nloops * ntests + 1)) << std::endl;
    }
    
    {
        uint64_t vcnt = 0;
        size_t constexpr ntries = 1 << 4, nloops = 1 << 6, ntests = 1 << 11;
        size_t const nthr = test_threads;
        uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
        std::mutex mux;
        std::barrier barrier(nthr, []() noexcept {});
        std::vector<std::pair<uint64_t, uint64_t>> timings(nthr);
        std::vector<std::optional<std::thread>> threads(nthr);
        auto const ReLock = [&mux, &vcnt]() FINL {
            std::lock_guard<std::mutex> lock(mux);
            ++vcnt;
        };
        auto const stim = Time();
        for (size_t itry = 0; itry < ntries; ++itry) {
            std::vector<std::atomic<uint32_t>> brs(nthr);
            for (size_t ithr = 0; ithr < threads.size(); ++ithr) {
                threads[ithr] = std::thread([&, ithr]{ RunMT(nthr, ithr, ReLock, nloops, ntests, barrier, timings, min_time, min_rdtsc, brs); });
                SetAffinity(*threads[ithr], nthr * 2 <= hard_threads ? ithr * 2 : ithr);
            }
            for (auto & t: threads)
                t->join();
            threads.clear();
            threads.resize(nthr);
        }
        std::cout << "Multi -Threaded Mutex    time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
            << " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
            << std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
            << std::setprecision(8) << double(vcnt) / (ntries * nthr * nloops * ntests) << std::endl;
    }
}

Output on GodBolt Linux servers:

2 threads.
Single-Threaded Spinlock time:   8.63 ns ( 25.91 cycles, 3.00 GHz), test time 0.31 sec, precision 1.00000000
Multi -Threaded Spinlock time:   8.51 ns ( 25.58 cycles, 3.00 GHz), test time 1.04 sec, precision 1.00000000
Single-Threaded Mutex    time:  15.89 ns ( 47.66 cycles, 3.00 GHz), test time 0.82 sec, precision 1.00000000
Multi -Threaded Mutex    time:  17.94 ns ( 53.86 cycles, 3.00 GHz), test time 4.18 sec, precision 1.00000000

Output on old Windows laptop:

2 threads.
Single-Threaded Spinlock time:  31.40 ns ( 67.22 cycles, 2.14 GHz), test time 0.48 sec, precision 1.00000000
Multi -Threaded Spinlock time:  32.45 ns ( 70.21 cycles, 2.16 GHz), test time 0.98 sec, precision 1.00000000
Single-Threaded Mutex    time:  79.69 ns (168.36 cycles, 2.11 GHz), test time 0.94 sec, precision 1.00000000
Multi -Threaded Mutex    time:  83.67 ns (269.23 cycles, 3.22 GHz), test time 0.91 sec, precision 1.00000000

Thanks for the reply Arty. It looks like you are measuring the latency of an *uncontended* spinlock / mutex, by having a single thread repeatedly lock and unlock it. But I am interested in the latency of a *contended* lock, which is why I have the main() thread wait for the lock while the second thread does work. — rampatowl, May 10 '22 at 14:01
@rampatowl It is actually totally same time, at least on all Intel CPU models. It doesn't matter if other thread struggles for same lock or not, on Intel CPU both cases will give exactly same timings, or maybe at least Almost. For spin lock XCHG instruction is used on Intel which is main time consumer because it locks and unlocks cache line, the other thread just waits till cache line lock is released, then it takes second ownership of lock by doing same cache line locking. If there will be any time difference then not very big.Anyway if you're in doubt then I can do multi-threaded measure too — Arty, May 10 '22 at 14:07
Interesting - my understanding was uncontended vs. contended lock made a big difference in latency. Especially for mutex, if lock is not available, thread will be put to sleep and there will be a context switch on wake up. If you are able to reproduce <100ns results for contended case (ie every lock() case has to wait for other thread to complete work) that would be very interesting! Some subtleties around ordering the lock() calls and the work in the blocking thread to make sure that we are actually measuring contended lock. — rampatowl, May 10 '22 at 14:23
@rampatowl Just now made part two of my answer above, where wrote totally new and much bigger program for time measurement. Please put a look at new 2nd part of my answer (1st remained same). This improved program now measures contended multi-threaded case. Program uses extra atomics to synchronize threads so that they struggle for same spinlock or mutex many times. I don't measure single precise point of one lock/unlock between threads but instead I leave two threads to struggle many times for exactly same spinlock or mutex. — Arty, May 10 '22 at 18:01
@rampatowl Continuing on previous comment. This race condition will definitely cause each thread to wait a lot of time for other thread to release lock. So this precisely measures a case of heavy use of spinlock/mutex re-locking. Basically main time consumer is re-locking itself, all other stuff takes unnoticable time. On all Intel CPUs spinlock (and mutex in Linux) are implemented through `XCHG` instruction, which takes 17-20 cycles, so you may see that my program shows around `25 cycles` as it has a bit of other fast instructions too. — Arty, May 10 '22 at 18:01
@rampatowl Continuing on previous two comments. Beware that on your home PC it may show MUCH bigger time for multi threaded version, this happens if one of CPU cores is occupied, or if threads are re-scheduled on same core. But see `Try it online!` (link before code) timings, it shows that GodBolt server has 25 cycles time for spinlock and 50 cycles for mutex. My home laptop also in 4 out of 5 runs shows big Multi-threaded time and around 20% of runs it shows time almost equal to single-threaded version. — Arty, May 10 '22 at 18:01
@rampatowl Continuing on 3 comments above. It may happen that bad timings, 5 times bigger time for Multi Threaded version may happen if two threads become out of sync.Meaning that 1st thread will always catch lock while 2nd thread almost never.This is called starvation,when only 1st threads gets lock all the time.In this case 2nd thread will do work much later and hence bad timing.Just re-run program several times and you'll see that sometimes Multi-threaded version has good timings,totally equal to single threaded.This is the case when two threads are in sync,they occupy Lock after each other — Arty, May 10 '22 at 18:28
@rampatowl Continuing on 4 comments above. For you to understand why spinlock/mutex may occupy just 25 cycles of CPU (7-10 ns) - following happens inside spinlock, 1st thread executes `MOV reg, 1; XCHG mem, reg` instructions, this interchanges (swaps) value of 1 inside register with value in memory `mem`, if `mem` contained 1 it means spinlock was locked hence thread repeats `XCHG` in a loop more times. If `mem` contained 0 then spinlock was unlocked and we acquire it by placing 1 into `mem`. `XCHG` is atomic and takes 20 cycles, nothing else is needed. Unlocking is 1/2 cycle, `MOV mem, 0`. — Arty, May 10 '22 at 18:44
@rampatowl Continuing on 5 comments above. Spinlock as I said takes 22-25 cycles in total on Intel. Thread doesn't sleep at all with spinlock, and even with mutex sleeps very rarely, only if Lock was held to long, Mutex in Linux just increments `++counter;` in a loop and if `counter >= 500` (or other value) then it forces thread to sleep. So using Mutex doesn't imply bigger time due to waking up of a thread. With short tasks thread will not sleep, with long tasks yes it sleeps and takes more time, then you're right. But in my Answer above I only measured very short tasks, without sleep. — Arty, May 10 '22 at 18:48
very intereseted to your tests (of both of you), I can not comment on the fact that lock contention must be taken into account, and in particular it must be considered when the threads involved in the synchronization reside on diferent cores or, even worse, on different CPUs. Things should change dramatically in both cases when locking involve some sort of cache trashing - as you can experiment placing different threads on different CPUs with pthread_setaffinity_np() for example — Sigi, Mar 07 '23 at 13:01

Comparing latency of mutex vs. spinlock wake time

1 Answers1