The system is Linux (Gentoo x64), the code is C++. I have a daemon application, several instances of which are run on the same machine. The application is multithreaded itself. For some time, I have been observing strange delays in its performance.
After putting some debugging code, I came up with a strange thing when several instances of the daemon literally block simultaneously which is allegedly caused by some external reason or something. To put it all simple, I have a sequence like this:
- log time (t1)
- lock mutex
- call C++
std::list::push_back()
/pop_back()
(i.e. very simple math) - unlock mutex
- log time (t2)
From time to time, I clearly see that the sequence above running in several independent (!) processes blocks at step 2 (or probaby at step 4) for some really excessive time concerning the math at step 3 (for instance, 0.5 - 1.0 seconds). As the proof, I see that t2 in the logs for all the processes is literally the same (different in some microseconds). It looks like some threads of the processes enter the section at relatively different times (I can clearly see 0.5 - 1 seconds difference for t1), lock in the mutex, and unlock at the SAME TIME having allegedly spent unreasonable amount of time in the lock according to the log (t2 - t1 difference). Looks creepy to me.
The manifestation of the issue is relatively rare, about once 5-10 minutes under moderate load. No NTP time shifts are logged within the test (that was my first idea actually). If it were NTP, there would not be ACTUAL delays in service, only wrong times in the log.
Where do I start? Do I start tuning the scheduler? What can theoretically block an entire multithreaded process in Linux?