I am an inexperienced developer and noticed that spdlog allows instantiating the multi-threaded loggers using the spdlog::async_factory
as the first template parameter.
auto basic_async_mt = spdlog::basic_logger_mt<spdlog::async_factory>("basic_mt", "logs/basic_mt.log", true);
// vs
auto basic_mt = spdlog::basic_logger_mt("basic_mt", "logs/basic_mt.log", true);
Question
Given the results of my tests below, the asynchronous logging is slower than the synchronous multi-threaded versions and also scales much worse with growing contention, what is the purpose of the asynchronous implementation?
Simple Benchmarking
- Results: benchmark with default queue
********** Single threaded: 1000000 messages **********
basic_st Elapsed: 0.45 secs 2239158/sec
rotating_st Elapsed: 0.44 secs 2293912/sec
daily_st Elapsed: 0.45 secs 2210196/sec
********** Multi threaded: 1 threads, 1000000 messages **********
basic_mt Elapsed: 0.41 secs 2458098/sec
rotating_mt Elapsed: 0.45 secs 2238492/sec
daily_mt Elapsed: 0.48 secs 2074991/sec
********** Multi threaded: 10 threads, 1000000 messages **********
basic_mt Elapsed: 1.05 secs 950041/sec
rotating_mt Elapsed: 1.02 secs 977636/sec
daily_mt Elapsed: 1.11 secs 900860/sec
********** Async threaded: 1 threads, 1000000 messages **********
basic_mt Elapsed: 0.79 secs 1259206/sec
rotating_mt Elapsed: 1.00 secs 1002389/sec
daily_mt Elapsed: 0.84 secs 1197042/sec
********** Async threaded: 10 threads, 1000000 messages **********
basic_mt Elapsed: 2.56 secs 390791/sec
rotating_mt Elapsed: 2.72 secs 367184/sec
daily_mt Elapsed: 2.72 secs 368253/sec
Since the repository includes some simple benchmarks i ran bench first with some modifications: I set the thread pool queue size 32768 with a single thread pool worker with the line spdlog::init_thread_pool(1 << 15, 1);
. I further included added a bench_async_logging
version of bench_threaded_logging
that uses the spdlog::async_factory
template parameter as shown above.
The results show that the async version perform significantly worse than the normal threaded version.
Synchronisation happens, as far as I can tell, in the thread pool's member q_type
, that is a mpmc_blocking_queue
.
The queue synchronises access via a single mutex, which may degrade performance in this synthetic benchmark with 10 producer threads and 1 consumer thread. Real systems may exhibit very different behaviour, but the throughput of the asynchronous version is only half that of the multi-threaded version even with only 1 producer thread. With the experimental changes below, the asynchronous version achieves nearly equal performance for 1 thread and significantly better performance for 10 threads.
Experimental Improvement using a lock-free bounded queue
- source: https://github.com/Araeos/spdlog/tree/async-lock-free (see include/mpmp_queue.hpp)
- results: benchmark with lock-free queue
********** Single threaded: 1000000 messages **********
basic_st Elapsed: 0.44 secs 2248376/sec
rotating_st Elapsed: 0.45 secs 2217191/sec
daily_st Elapsed: 0.48 secs 2092311/sec
********** Multi threaded: 1 threads, 1000000 messages **********
basic_mt Elapsed: 0.42 secs 2395124/sec
rotating_mt Elapsed: 0.45 secs 2242889/sec
daily_mt Elapsed: 0.51 secs 1960067/sec
********** Multi threaded: 10 threads, 1000000 messages **********
basic_mt Elapsed: 0.98 secs 1019659/sec
rotating_mt Elapsed: 1.28 secs 782557/sec
daily_mt Elapsed: 1.14 secs 879411/sec
********** Async threaded: 1 threads, 1000000 messages **********
basic_mt Elapsed: 0.50 secs 2004096/sec
rotating_mt Elapsed: 0.56 secs 1781594/sec
daily_mt Elapsed: 0.49 secs 2053135/sec
********** Async threaded: 10 threads, 1000000 messages **********
basic_mt Elapsed: 0.62 secs 1604676/sec
rotating_mt Elapsed: 0.71 secs 1415312/sec
daily_mt Elapsed: 0.73 secs 1378370/sec
To validate the assumption I used a lock-free bounded queue implementation provided by Dmitry Vyukov on his blog, that i slightly modified.
The queue does not implement a blocking enqueue
operation and no timed dequeue_for
operation.
In my MPMCQueueAdapter
adapter for this queue I just used simple spinning with some backoff mechanism to implement the required waiting. The queue may be quite fast, but C++ atomics do not provide a standard wait mechanism until C++20. The original non-blocking enqueue_nowait
exhibited different behaviour in that it replaced the oldest entry in the queue, while my adapter simply tries to enqueue the message and does nothing if full. Due to these caveats, this is only for performance reference to find how much thread contention may be a factor here.
Using the lock-free queue under the hood improves performance drastically, though the benchmark is synthetic and such high contention may not happen in actual systems. The crux of the default implementation may just be that the consumer (the thread that forwards the message to the actual sink) synchronises with the same mutex as the producers or my tests are simply deeply flawed in some way. What are your thoughts?
Test System
- OS: Windows 10 x64 Version 1903
- Compiler: Visual Studio 2019 16.4
- Ram: 16GB DDR4 3200
- CPU: Ryzen 2600 6C/12T
- SSD: Samsung 850 EVO 250GB
EDIT: I used a queue size of 32768 for the benchmark results above instead of the originally stated 16384.