Relatively low performance of spdlog when using asynchronous loggers with benchmarks

Question

I am an inexperienced developer and noticed that spdlog allows instantiating the multi-threaded loggers using the spdlog::async_factory as the first template parameter.

auto basic_async_mt = spdlog::basic_logger_mt<spdlog::async_factory>("basic_mt", "logs/basic_mt.log", true);
// vs
auto basic_mt       = spdlog::basic_logger_mt("basic_mt", "logs/basic_mt.log", true);

Question

Given the results of my tests below, the asynchronous logging is slower than the synchronous multi-threaded versions and also scales much worse with growing contention, what is the purpose of the asynchronous implementation?

Simple Benchmarking

Results: benchmark with default queue

**********   Single threaded: 1000000 messages   **********
basic_st                       Elapsed: 0.45 secs          2239158/sec
rotating_st                    Elapsed: 0.44 secs          2293912/sec
daily_st                       Elapsed: 0.45 secs          2210196/sec
**********   Multi threaded: 1 threads, 1000000 messages   **********
basic_mt                       Elapsed: 0.41 secs          2458098/sec
rotating_mt                    Elapsed: 0.45 secs          2238492/sec
daily_mt                       Elapsed: 0.48 secs          2074991/sec
**********   Multi threaded: 10 threads, 1000000 messages   **********
basic_mt                       Elapsed: 1.05 secs           950041/sec
rotating_mt                    Elapsed: 1.02 secs           977636/sec
daily_mt                       Elapsed: 1.11 secs           900860/sec
**********   Async threaded: 1 threads, 1000000 messages   **********
basic_mt                       Elapsed: 0.79 secs          1259206/sec
rotating_mt                    Elapsed: 1.00 secs          1002389/sec
daily_mt                       Elapsed: 0.84 secs          1197042/sec
**********   Async threaded: 10 threads, 1000000 messages   **********
basic_mt                       Elapsed: 2.56 secs           390791/sec
rotating_mt                    Elapsed: 2.72 secs           367184/sec
daily_mt                       Elapsed: 2.72 secs           368253/sec

Since the repository includes some simple benchmarks i ran bench first with some modifications: I set the thread pool queue size 32768 with a single thread pool worker with the line spdlog::init_thread_pool(1 << 15, 1);. I further included added a bench_async_logging version of bench_threaded_logging that uses the spdlog::async_factory template parameter as shown above. The results show that the async version perform significantly worse than the normal threaded version. Synchronisation happens, as far as I can tell, in the thread pool's member q_type, that is a mpmc_blocking_queue. The queue synchronises access via a single mutex, which may degrade performance in this synthetic benchmark with 10 producer threads and 1 consumer thread. Real systems may exhibit very different behaviour, but the throughput of the asynchronous version is only half that of the multi-threaded version even with only 1 producer thread. With the experimental changes below, the asynchronous version achieves nearly equal performance for 1 thread and significantly better performance for 10 threads.

Experimental Improvement using a lock-free bounded queue

source: https://github.com/Araeos/spdlog/tree/async-lock-free (see include/mpmp_queue.hpp)
results: benchmark with lock-free queue

**********   Single threaded: 1000000 messages   **********
basic_st                       Elapsed: 0.44 secs          2248376/sec
rotating_st                    Elapsed: 0.45 secs          2217191/sec
daily_st                       Elapsed: 0.48 secs          2092311/sec
**********   Multi threaded: 1 threads, 1000000 messages   **********
basic_mt                       Elapsed: 0.42 secs          2395124/sec
rotating_mt                    Elapsed: 0.45 secs          2242889/sec
daily_mt                       Elapsed: 0.51 secs          1960067/sec
**********   Multi threaded: 10 threads, 1000000 messages   **********
basic_mt                       Elapsed: 0.98 secs          1019659/sec
rotating_mt                    Elapsed: 1.28 secs           782557/sec
daily_mt                       Elapsed: 1.14 secs           879411/sec
**********   Async threaded: 1 threads, 1000000 messages   **********
basic_mt                       Elapsed: 0.50 secs          2004096/sec
rotating_mt                    Elapsed: 0.56 secs          1781594/sec
daily_mt                       Elapsed: 0.49 secs          2053135/sec
**********   Async threaded: 10 threads, 1000000 messages   **********
basic_mt                       Elapsed: 0.62 secs          1604676/sec
rotating_mt                    Elapsed: 0.71 secs          1415312/sec
daily_mt                       Elapsed: 0.73 secs          1378370/sec

To validate the assumption I used a lock-free bounded queue implementation provided by Dmitry Vyukov on his blog, that i slightly modified. The queue does not implement a blocking enqueue operation and no timed dequeue_for operation. In my MPMCQueueAdapter adapter for this queue I just used simple spinning with some backoff mechanism to implement the required waiting. The queue may be quite fast, but C++ atomics do not provide a standard wait mechanism until C++20. The original non-blocking enqueue_nowait exhibited different behaviour in that it replaced the oldest entry in the queue, while my adapter simply tries to enqueue the message and does nothing if full. Due to these caveats, this is only for performance reference to find how much thread contention may be a factor here.

Using the lock-free queue under the hood improves performance drastically, though the benchmark is synthetic and such high contention may not happen in actual systems. The crux of the default implementation may just be that the consumer (the thread that forwards the message to the actual sink) synchronises with the same mutex as the producers or my tests are simply deeply flawed in some way. What are your thoughts?

Test System

OS: Windows 10 x64 Version 1903
Compiler: Visual Studio 2019 16.4
Ram: 16GB DDR4 3200
CPU: Ryzen 2600 6C/12T
SSD: Samsung 850 EVO 250GB

EDIT: I used a queue size of 32768 for the benchmark results above instead of the originally stated 16384.

By default, spdlog blocks when its async queue is full. You have increased the queue size from the default of 8K to 16K. Maybe try a much larger number? Also, you may want to see what happens a different [async_overflow_policy](https://github.com/gabime/spdlog/wiki/6.-Asynchronous-logging#full-queue-policy) — jwd, Feb 20 '20 at 19:41
I tested the non blocking async queue (`async_factory_nonblock`) and the non blocking variant also has lower throughput with 10 threads (basic_mt 1'595'436/sec) than with 1 thread (basic_mt 2'317'875/sec). This actually better illustrates the bottleneck of the spdlog queue, since the experimental lock-free variant in my question achieves basic_mt: 8'463'158/sec throughput, though with simpler and incorrect enqueue semantics in case of overrun as this is just a toy benchmark/implementation. Also, increasing the queue size does not reduce contention, I believe. — Araeos, Feb 25 '20 at 09:58
To somehow answer this question: "what is the purpose of the asynchronous implementation?" I don't know much about spdlog, but I wrote asynchronous logger in the past. Reason for this was one and only one: it allows offloading part of work (e.g. part of formatting and writing to disk/file/console) from thread that should work as fast as possible to some other thread. This worked only when there were more CPUs than threads, and it was slower than logging directly - *but* it indeed allowed aforementioned thread to run faster. — Jędrzej Dudkiewicz, Feb 25 '20 at 12:45

Relatively low performance of spdlog when using asynchronous loggers with benchmarks

Question

Simple Benchmarking

Experimental Improvement using a lock-free bounded queue

Test System

0 Answers0