Spinlock and mutex are both MUCH more performant than you think if measurements are done correctly.
My following code on my old 2GHz Windows laptop shows 75 ns
for mutex and 12.5 ns
for spinlock. And on modern 3GHz GodBolt online Linux servers mutex shows on average 15 ns
and spinlock shows 8 ns
. See more details after code, there is located console output of measurements.
GodBolt Linux mutex is much faster, because in Linux is implemented in much more performant way than in Windows. On Windows you may try std::shared_mutex instead of std::mutex, because shared was implemented in much later version of Windows and was made through more performant algorithm and API, shared version of mutex shows 50 ns
on my laptop, while regular mutex is 75 ns
.
I also measure cycles and GHz that are shown in console output. Note that as cycles are measured by RDTSC instruction, this cycles count corresponds to Base frequence, it means if currently CPU has thermal throttling or turbo boost, i.e. has changed its speed from Base, then number of cycles
will be shown incorrectly. Only number of nano-seconds (ns
) is always shown correctly, corresponding to current CPU speed.
How to correctly measure everything here. First to ONLY measure time of spinlock and mutex itself without any extra work you need to measure only single thread.
But to measure only single thread you need to Show compiler that you're Going to use mutex and spinlock in other threads. For that I create a second dummy thread through std::async and keep its future in volatile variable. This dummy thread prevents compiler from optimizing away mutex and spinlock code, as it sees that it is used in other thread simultaneously. Without dummy thread mutex and spinlock code might be removed from single (main) thread.
Second dummy thread actually does nothing, it locks mutex and spinlock just once and quits. So this second thread will not disturb our measurements.
One more important thing is regarding std::atomic_flag - you have to use more relaxed memory order, specifically in my code you can see that I use std::memory_order_acquire
in locking line of spinlock, and std::memory_order_release
in release line. This tighter memory order will help to get more speed from spinlock, by removing unnecessary operations. By default a much more slower sequential-only memory order std::memory_order_seq_cst
is used.
Next very important thing, is probably a main reason of wrong results in your code - you should do Many measurements and choose smallest one. Why smallest? Because if you measure for 1 second then during this period operating system (Windows/Linux/MacOS) will rotate (re-schedule) all threads many times, around 1 time in each 10-15 milli-second. This will cause very huge pause between each wake-ups of each thread, and spoil all our results.
To avoid this timings spoiling when threads are re-scheduled, we do two things - first we measure quite small amount of time, around 50-100 micro-seconds, second - we do this many times (10-15 times). For any operating system we're quite guaranteed that at least one 50-micro-seconds measurement will be correct, i.e. will not include re-schedule of threads.
100 micro seconds time is quite alright for precision of std::chrono::high_resolution_clock, it usually has a resolution of about 200-500 nano-seconds.
Taking mininmal time out of 15 measurements will help us to avoid other hardware delays. Because minimal time will show most performant result, almost this result will usually happen on average later in real systems.
I also use RDTSC to measure cycles, this is only to show in console besides nano-seconds ns
, to show also cycles
and GHz
of CPU.
Tiny more improvement of speed is by repeating lock/unlock of both mutex and spinlock two times inside main measurement loop. This is a small manual loop-unrolling, to be even more precise and do less loop variable increments and comparisons.
Try it online on GodBolt!
#include <cstdint>
#include <atomic>
#include <mutex>
#include <future>
#include <iostream>
#include <iomanip>
#include <immintrin.h>
int main() {
auto Rdtsc = []() -> uint64_t { return __rdtsc(); };
auto Time = []() -> uint64_t {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::high_resolution_clock::now() - gtb).count();
};
std::cout << std::fixed << std::setprecision(2);
size_t constexpr nloops = 1 << 4, ntests = 1 << 13;
{
uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
std::atomic_flag f = ATOMIC_FLAG_INIT;
auto volatile ares = std::async(std::launch::async, [&]{ while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); });
for (size_t j = 0; j < nloops; ++j) {
auto tim = Time(), timr = Rdtsc();
for (size_t i = 0; i < ntests / 2; ++i) {
{ while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); }
{ while (f.test_and_set(std::memory_order_acquire)) {} f.clear(std::memory_order_release); }
}
tim = Time() - tim;
timr = Rdtsc() - timr;
if (tim < min_time) {
min_time = tim;
min_rdtsc = timr;
}
}
std::cout << "Spinlock time: " << double(min_time) / ntests << " ns (" << double(min_rdtsc) / ntests
<< " cycles, " << double(min_rdtsc) / min_time << " GHz)" << std::endl;
}
{
uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
std::mutex mux;
auto volatile ares = std::async(std::launch::async, [&]{ std::lock_guard<std::mutex> lock(mux); });
for (size_t j = 0; j < nloops; ++j) {
auto tim = Time(), timr = Rdtsc();
for (size_t i = 0; i < ntests / 2; ++i) {
{ std::lock_guard<std::mutex> lock(mux); }
{ std::lock_guard<std::mutex> lock(mux); }
}
tim = Time() - tim;
timr = Rdtsc() - timr;
if (tim < min_time) {
min_time = tim;
min_rdtsc = timr;
}
}
std::cout << "Mutex time : " << double(min_time) / ntests << " ns (" << double(min_rdtsc) / ntests
<< " cycles, " << double(min_rdtsc) / min_time << " GHz)" << std::endl;
}
}
Output of modern 3GHz Linux GodBolt online server:
Spinlock time: 7.65 ns (22.95 cycles, 3.00 GHz)
Mutex time : 15.59 ns (46.78 cycles, 3.00 GHz)
Output of old 2GHz Windows laptop:
Spinlock time: 12.38 ns (25.94 cycles, 2.10 GHz)
Mutex time : 74.27 ns (155.58 cycles, 2.09 GHz)
(std::shared_mutex
time for PC above is 50 ns
)
One can make a program more complex by running multiple threads to measure how timings behave in a case if multiple threads race to get lock of same atomic flag or mutex.
Following program tests both single threaded and multi threaded timings. Plus it also does increment of a counter as a kind of spinlock/mutex-protected work.
To get precise results program sets affinity of a thread by assigning different threads to different cores.
Multi threaded version uses std::barrier to do start/stop of measurements at same precise points of time. Also it uses atomic counters to synchronize speed of each thread so that they advance at same rate.
Timings of multi threaded version should be almost equal to single threaded version, especially it will be close on modern CPUs. Timings of multi threaded can be MUCH bigger only if chosen (by affinity) cores are occupied too much by operating system. Also timings can be bigger if some operating system re-schedules threads in some strange order. It happened to me that Windows multi-threaded timings were big, while Linux timings where totally alright.
See timings after code. Note that GodBolt servers' timings are very tiny. Spinlock (both single and multi threaded) is 8 ns
(23 cycles
), which corresponds to timings of atomic flag's main instruction XCHG (17 cycles
here). Mutex is 16 ns
(50 cycles
), almost same as spinlock, because in Linux it is implemented with atomic flag, extra overhead is due to syscall
operation or CALL
instruction.
Try it online!
#include <cstdint>
#include <atomic>
#include <mutex>
#include <future>
#include <iostream>
#include <iomanip>
#include <barrier>
#include <optional>
#include <thread>
#include <vector>
#include <immintrin.h>
#if defined(_MSC_VER) && !defined(__clang__)
#define FINL [[msvc::forceinline]]
#else
#define FINL __attribute__((always_inline))
#endif
#ifdef _WIN32
#include <windows.h>
// https://stackoverflow.com/a/56486809/941531
inline void SetAffinity(std::thread & thr, size_t i) {
if (SetThreadAffinityMask(thr.native_handle(), DWORD_PTR(1) << i) == 0)
std::cout << "SetThreadAffinityMask failed, GLE = " << GetLastError() << std::endl;
}
#else
#include <pthread.h>
// https://stackoverflow.com/a/57620568/941531
inline void SetAffinity(std::thread & thr, size_t i) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(thr.native_handle(), sizeof(cpu_set_t), &cpuset);
if (rc != 0)
std::cout << "Error calling pthread_setaffinity_np: " << rc << std::endl;
}
#endif
int main() {
auto Rdtsc = []() -> uint64_t { return __rdtsc(); };
auto Time = []() -> uint64_t {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::high_resolution_clock::now() - gtb).count();
};
std::cout << std::fixed << std::setprecision(2);
auto RunST = [&Rdtsc, &Time](auto const & ReLock, auto nloops, auto ntests, auto & min_time, auto & min_rdtsc){
{ auto volatile ares = std::async(std::launch::async, [&]{ ReLock(); }); }
for (size_t j = 0; j < nloops; ++j) {
auto timr = Rdtsc(), tim = Time();
for (size_t i = 0; i < ntests / 4; ++i) {
ReLock(); ReLock(); ReLock(); ReLock();
}
tim = Time() - tim;
timr = Rdtsc() - timr;
if (tim < min_time) {
min_time = tim;
min_rdtsc = timr;
}
}
};
auto RunMT = [&Rdtsc, &Time](auto nthr, auto ithr, auto const & ReLock, auto nloops, auto ntests, auto & barrier, auto & timings, auto & min_time, auto & min_rdtsc, auto & brs){
for (size_t j = 0; j < nloops; ++j) {
barrier.arrive_and_wait();
size_t nibr = (ithr + 1) % nthr;
uint32_t ibrc = 0;
auto ReSync = [&]{
ibrc += 16;
brs[ithr].store(ibrc, std::memory_order_relaxed);
while (ibrc > brs[nibr].load(std::memory_order_relaxed)) {}
};
ReSync();
auto timr = Rdtsc(), tim = Time();
for (size_t i = 0; i < ntests / 16; ++i) {
ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock();
ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock(); ReLock();
ReSync();
}
tim = Time() - tim;
timr = Rdtsc() - timr;
timings[ithr] = {tim, timr};
barrier.arrive_and_wait();
if (ithr != 0)
continue;
std::sort(timings.begin(), timings.end());
if (timings.back().first >= timings.front().first * 1.3)
continue;
tim = 0; timr = 0;
for (size_t i = 0; i < timings.size(); ++i) {
tim += timings[i].first;
timr += timings[i].second;
}
tim /= timings.size();
timr /= timings.size();
if (tim < min_time) {
min_time = tim;
min_rdtsc = timr;
}
}
};
auto const hard_threads = std::thread::hardware_concurrency();
auto const test_threads = 2; //std::max<size_t>(2, hard_threads / 2);
std::cout << test_threads << " threads." << std::endl;
{
uint64_t vcnt = 0;
size_t constexpr ntries = 1 << 4, nloops = 1 << 8, ntests = 1 << 11;
uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
std::atomic_flag f = ATOMIC_FLAG_INIT;
auto const ReLock = [&f, &vcnt]() FINL {
while (f.test_and_set(std::memory_order_acquire)) {}
++vcnt;
f.clear(std::memory_order_release);
};
auto const stim = Time();
for (size_t itry = 0; itry < ntries; ++itry)
RunST(ReLock, nloops, ntests, min_time, min_rdtsc);
std::cout << "Single-Threaded Spinlock time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
<< " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
<< std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
<< std::setprecision(8) << double(vcnt) / (ntries * (nloops * ntests + 1)) << std::endl;
}
{
uint64_t vcnt = 0;
size_t constexpr ntries = 1 << 4, nloops = 1 << 6, ntests = 1 << 11;
size_t const nthr = test_threads;
uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
std::atomic_flag f = ATOMIC_FLAG_INIT;
std::barrier barrier(nthr, []() noexcept {});
std::vector<std::pair<uint64_t, uint64_t>> timings(nthr);
std::vector<std::optional<std::thread>> threads(nthr);
auto const ReLock = [&f, &vcnt]() FINL {
while (f.test_and_set(std::memory_order_acquire)) {}
++vcnt;
f.clear(std::memory_order_release);
};
auto const stim = Time();
for (size_t itry = 0; itry < ntries; ++itry) {
std::vector<std::atomic<uint32_t>> brs(nthr);
for (size_t ithr = 0; ithr < threads.size(); ++ithr) {
threads[ithr] = std::thread([&, ithr]{ RunMT(nthr, ithr, ReLock, nloops, ntests, barrier, timings, min_time, min_rdtsc, brs); });
SetAffinity(*threads[ithr], nthr * 2 <= hard_threads ? ithr * 2 : ithr);
}
for (auto & t: threads)
t->join();
threads.clear();
threads.resize(nthr);
}
std::cout << "Multi -Threaded Spinlock time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
<< " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
<< std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
<< std::setprecision(8) << double(vcnt) / (ntries * nthr * nloops * ntests) << std::endl;
}
{
uint64_t vcnt = 0;
size_t constexpr ntries = 1 << 4, nloops = 1 << 8, ntests = 1 << 11;
uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
std::mutex mux;
auto const ReLock = [&mux, &vcnt]() FINL {
std::lock_guard<std::mutex> lock(mux);
++vcnt;
};
auto const stim = Time();
for (size_t itry = 0; itry < ntries; ++itry)
RunST(ReLock, nloops, ntests, min_time, min_rdtsc);
std::cout << "Single-Threaded Mutex time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
<< " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
<< std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
<< std::setprecision(8) << double(vcnt) / (ntries * (nloops * ntests + 1)) << std::endl;
}
{
uint64_t vcnt = 0;
size_t constexpr ntries = 1 << 4, nloops = 1 << 6, ntests = 1 << 11;
size_t const nthr = test_threads;
uint64_t min_time = 1ULL << 50, min_rdtsc = 0;
std::mutex mux;
std::barrier barrier(nthr, []() noexcept {});
std::vector<std::pair<uint64_t, uint64_t>> timings(nthr);
std::vector<std::optional<std::thread>> threads(nthr);
auto const ReLock = [&mux, &vcnt]() FINL {
std::lock_guard<std::mutex> lock(mux);
++vcnt;
};
auto const stim = Time();
for (size_t itry = 0; itry < ntries; ++itry) {
std::vector<std::atomic<uint32_t>> brs(nthr);
for (size_t ithr = 0; ithr < threads.size(); ++ithr) {
threads[ithr] = std::thread([&, ithr]{ RunMT(nthr, ithr, ReLock, nloops, ntests, barrier, timings, min_time, min_rdtsc, brs); });
SetAffinity(*threads[ithr], nthr * 2 <= hard_threads ? ithr * 2 : ithr);
}
for (auto & t: threads)
t->join();
threads.clear();
threads.resize(nthr);
}
std::cout << "Multi -Threaded Mutex time: " << std::setprecision(2) << std::setw(6) << double(min_time) / ntests
<< " ns (" << std::setw(6) << double(min_rdtsc) / ntests << " cycles, " << double(min_rdtsc) / min_time << " GHz), test time "
<< std::setw(4) << (Time() - stim) / 1'000'000'000.0 << " sec, precision "
<< std::setprecision(8) << double(vcnt) / (ntries * nthr * nloops * ntests) << std::endl;
}
}
Output on GodBolt Linux servers:
2 threads.
Single-Threaded Spinlock time: 8.63 ns ( 25.91 cycles, 3.00 GHz), test time 0.31 sec, precision 1.00000000
Multi -Threaded Spinlock time: 8.51 ns ( 25.58 cycles, 3.00 GHz), test time 1.04 sec, precision 1.00000000
Single-Threaded Mutex time: 15.89 ns ( 47.66 cycles, 3.00 GHz), test time 0.82 sec, precision 1.00000000
Multi -Threaded Mutex time: 17.94 ns ( 53.86 cycles, 3.00 GHz), test time 4.18 sec, precision 1.00000000
Output on old Windows laptop:
2 threads.
Single-Threaded Spinlock time: 31.40 ns ( 67.22 cycles, 2.14 GHz), test time 0.48 sec, precision 1.00000000
Multi -Threaded Spinlock time: 32.45 ns ( 70.21 cycles, 2.16 GHz), test time 0.98 sec, precision 1.00000000
Single-Threaded Mutex time: 79.69 ns (168.36 cycles, 2.11 GHz), test time 0.94 sec, precision 1.00000000
Multi -Threaded Mutex time: 83.67 ns (269.23 cycles, 3.22 GHz), test time 0.91 sec, precision 1.00000000