0

I’m using spdlog library in my high-load service. Also my service is getting requests from another service by gRPC.

At this moment I can’t use streaming gRPC, and every call creates its own connection. As I know I can’t restrict threads count gRPC creating.

Now my service began to run into performance due to the fact that each thread writes to the same log file, which is blocked by a mutex inside the spdlog.

What are the solutions to this problem besides horizontal scaling (increasing the number of processes)

Leonid
  • 75
  • 1
  • 7
  • 3
    Fix logging. For example use a separate single thread that logs to file. While other threads send logs to a thread-safe (lock-free) queue. You will need a careful design though, to not run out of memory (the queue might grow faster than the logger logs). – freakish Jun 03 '23 at 17:48
  • @freakish Is there a chance that I will lose some of the logs due to some system error? – Leonid Jun 03 '23 at 17:52
  • 1
    Sure. Your logs might be delayed under heavy load. So if your service dies before the dedicated thread processes all the logs, you will lose some of them forever. It's a tradeoff. So the implementation needs to be bulletproof. Or maybe you shouldn't worry, I mean, how probable such situation is? – freakish Jun 03 '23 at 17:53
  • What freakish says. The downside is that the logging request calls return before the log is actually written, which may be an issue with some bugs that crash the entire process. The upside - almost everything else as disk delays/latencies are decoupled from the main app and log file management becomes trivial. – Martin James Jun 03 '23 at 17:55
  • I usually use a pool queue - another producer-consumer queue pre-loaded with, say, 256 log entry instances. When the main app wants to log something, it pops one off the pool, loads it up with text and pushes it to the logger thread. The logger gets the instance, writes the content and then pushes the 'used' instance back to the pool queue. If all 256 instances are stuffed onto the logger queue, the main app is then forced to wait on the pool queue before it can send another log. That slows it up during the backlog, but nothing crashes or runs away:) – Martin James Jun 03 '23 at 18:07
  • Are you writing synchronously with `spdlog`? – RandomBits Jun 03 '23 at 20:56
  • Is it strictly required that everything go into a single shared log, or could e.g. each thread write to its own separate log file? That would avoid the bottleneck, and you could always write a separate post-processing utility to read the various per-thead log files and write out a corresponding unified log file later on, if/when it turned out to be necessary to do so. – Jeremy Friesner Jun 03 '23 at 21:19
  • Take a look at [spdlog's Asynchronous logging](https://github.com/gabime/spdlog/wiki/6.-Asynchronous-logging). It delegates `log` and `flush` operations to a dedicated worker thread pool to process them asynchronously. – Sprite Jun 20 '23 at 05:31

2 Answers2

0

If I understand your requirements correctly, there is an application with multiple threads logging to a single archive synchronously using spdlog.

I answered a similar question about logging suggesting the use of a memory-mapped file which turned out to be about 100x faster than spdlog for the prototype code posted in the answer.

RandomBits
  • 4,194
  • 1
  • 17
  • 30
0

Mutex itself is not a problem. What is a problem is that the library probably does i/o (file writes) under mutex (although honestly I do not know the details). And it can get even worse if network drives (or even services) are involved.

So, one of the possible solutions is to delegate this heavy i/o onto a separate, dedicated thread. And then each other thread will communicate with this dedicated thread by passing messages (logs) to a thread-safe (preferably lock-free) queue. With this approach it is extremely likely that logging won't be a bottleneck anymore.

The drawback is that log processing becomes detached with this approach. This has two negative consequences:

  1. Logging may lag behind calls. Meaning you keep processing data, but logging gets bigger and bigger lag. So not only logs don't arrive on time, but also, in case of system failure you will loose all those pending logs. In practice however, this is very rarely a real issue.
  2. Under heavy load the queue may grow faster than the dedicated thread is capable of processing it. And therefore you may run out of memory. This can be solved by putting a semaphore somewhere and not allowing the queue to grow too large. So it is a fallback to blocking calls, but it has to be done at some threshold, otherwise you risk crash.

With such design logging should not be a problem anymore.

freakish
  • 54,167
  • 9
  • 132
  • 169