8

The system is Linux (Gentoo x64), the code is C++. I have a daemon application, several instances of which are run on the same machine. The application is multithreaded itself. For some time, I have been observing strange delays in its performance.

After putting some debugging code, I came up with a strange thing when several instances of the daemon literally block simultaneously which is allegedly caused by some external reason or something. To put it all simple, I have a sequence like this:

  1. log time (t1)
  2. lock mutex
  3. call C++ std::list::push_back()/pop_back() (i.e. very simple math)
  4. unlock mutex
  5. log time (t2)

From time to time, I clearly see that the sequence above running in several independent (!) processes blocks at step 2 (or probaby at step 4) for some really excessive time concerning the math at step 3 (for instance, 0.5 - 1.0 seconds). As the proof, I see that t2 in the logs for all the processes is literally the same (different in some microseconds). It looks like some threads of the processes enter the section at relatively different times (I can clearly see 0.5 - 1 seconds difference for t1), lock in the mutex, and unlock at the SAME TIME having allegedly spent unreasonable amount of time in the lock according to the log (t2 - t1 difference). Looks creepy to me.

The manifestation of the issue is relatively rare, about once 5-10 minutes under moderate load. No NTP time shifts are logged within the test (that was my first idea actually). If it were NTP, there would not be ACTUAL delays in service, only wrong times in the log.

Where do I start? Do I start tuning the scheduler? What can theoretically block an entire multithreaded process in Linux?

osgx
  • 90,338
  • 53
  • 357
  • 513
neoxic
  • 535
  • 3
  • 12
  • So the mutex protects a resource shared by the threads.. do you know how many threads are present when you see the blocking behavior? i.e. this might not be weird if a ton of threads are all waiting for their turn at that moment. Also, is there any reason that accessing the resource has a large variability of time? – nckturner Feb 15 '13 at 02:17
  • 3
    May be you could monitor the memory pressure on the system when this happens. Just a conjecture (not too stretched) push_back could perform allocation which might trigger virtual allocation and slowness may ensue if total commit charge is already high on the system. – nanda Feb 15 '13 at 02:33
  • What's calling this sequence? How often does it run? – ethrbunny Feb 15 '13 at 02:47
  • Good point about the memory, nanda. – neoxic Feb 15 '13 at 03:02
  • There are about 10 to 30 threads. No rush for the resource (I know this because the superior activity is also measured). And if it were real races for the resource, it wouldn't explain how the independent processess interfere in these terms. – neoxic Feb 15 '13 at 03:04
  • The sequence is called relatively seldom. To make things much easier to see, the list is actually a queue of sockets. It works in a way that a thread pops a socket, performs its i/o needs and pushes it back. No need to say that mutex locking/unlocking would look ridiculously cheap compared to i/o expenses even if there were reads/writes of single bytes (which is not the case). – neoxic Feb 15 '13 at 03:15
  • 2
    Haha! I got the bug! )) Sorry guys to have waisted your time, but any way it made me look more thoroughly into the logs. Simply said, the delays were due to debugging itself.)) In particular, debug logging was blocked due to disk delays. That's why the measurement of t2 was delayed. The disk delays explain why the processes interfered. Thanks to everyone! No mysteries this time - sounds good! Not sure how to close the question though. – neoxic Feb 15 '13 at 04:28
  • @Seny Follow up – if the time differences were due to the debug output, how did the initial problem occur? If I understood your question correctly, you added the debug statements _after_ you encountered scheduling problems? – Jonas Schäfer Feb 16 '13 at 00:08
  • Jonas, you are right! We actually did a bad trick - we upgraded the kernel and rebooted the cluster AND run a debug version of the program on the fresh system. The real bug was actually due to an alleged bug in the previous kernel or a network driver, or half a year of uptime added some bad fuel or something else. So we had been experiencing unexplainable delays in service over a local gigabit network, although no such issue had taken place over a loopback interface within the machines. – neoxic Feb 17 '13 at 17:56
  • So after the real issue was gone, I added a minor problem myself with the debug code and happily observed similar delays (as I thought). – neoxic Feb 17 '13 at 17:57
  • @Jonas - Consequently, due to rare disk blockages, I observed the above issue. Silly, I know.)) But the long annoying bug is gone - that's more important, right? – neoxic Feb 17 '13 at 18:22

1 Answers1

1

run your program with:

valgrind --tool=helgrind ./your_program

You will find more issues that you expect.

Valgrind (Helgrind) will give you a detailed scenario of your threaded application, nowadays a must before deployment.

osgx
  • 90,338
  • 53
  • 357
  • 513
edsiper
  • 398
  • 1
  • 4
  • Thanks, edsiper! Valgrind is my Holy Grail! However, it is not suitable for production debugging. – neoxic Feb 17 '13 at 18:02