0

I am experiencing a strange behavior in a software I am working on. It is a realtime-machine-controller, written in C++, running on Linux and it is making extensive use of multithreading.

When I run the program without asking it to be realtime, everything works like I expect it to. But when I ask it to switch to its realtime mode, there is a clearly reproducible bug that lets the application crash. It must be some deadlock-thing I guess, because it is a mutex that runs into a timeout and ultimately triggers a assertion.

My Question is, how to hunt this one down. Looking at the backtrace from the produced core is not very helpful as the reason for the problem lies somewhere in the past.

The following code does the switching between 'normal' and 'realtime' behaviour:

In main.cpp (simplified, return-codes are checked via assertions):

if(startAsRealtime){
struct sched_param sp;
memset(&sp, 0, sizeof(sched_param));
sp.sched_priority = 99;
sched_setscheduler(getpid(), SCHED_RR, &sp);}

In every thread (simplified, return-codes are checked via assertions):

if(startAsRealtime){
sched_param param;
pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
pthread_attr_getschedparam(&attr, &param);
param.sched_priority = priority;
pthread_attr_setschedpolicy(&attr, SCHED_RR);
pthread_attr_setschedparam(&attr, &param);}

Thanks in advance

user761451
  • 103
  • 1
  • 1
  • 7

2 Answers2

1

If you're using glibc as your C library, you could use the answer to the question Is it possible to list mutexs which a thread holds to find out the thread that is holding the mutex which is timing out. That should start to narrow things down - you can then inspect that thread and find out why it's not giving up the mutex.

Community
  • 1
  • 1
caf
  • 233,326
  • 40
  • 323
  • 462
  • I actually know which thread is holding that mutex. The problem is, it shouldn't keep it locked. I simply don't understand why running all threads at equal priority doesn't show the bug, while running under RR-Scheduler with different priorities does. – user761451 May 24 '11 at 10:48
  • 1
    @user761451: It sounds like it might be an instance of starvation or livelock rather than genuine deadlock - if the thread holding the mutex is of higher priority, it might be unlocking it, but always locking it again before the lower-priority thread gets a chance to lock it. – caf May 24 '11 at 12:46
  • It indeed turned out to be sort of a livelock. Stupid me build three threads that are looping over a function that contains a sleep... running all threads at the same prio allows the starving thread to get the lock sometimes, so it seems to work correctly. Unlocking before sleeping helped a lot. – user761451 May 25 '11 at 11:46
0

One of your realtime threads might be spinning in a loop (not yielding), thus starving other threads and resulting in a mutex timeout.

There could also be a race condition that only manifests itself when you switch to "realtime mode". The timing of events in realtime mode happens to trigger some kind of deadlock.

If you have places where you acquire multiple levels of locks, or lock recursively, those should be the first places you suspect.

If you really have no clue where the problem is, try the binary search approach for bracketing the problem. Recursively cut out half of the functionality until you narrow it down to the actual problem. You might have to mock some subsystems that are temporarily cut out.

You can apply this binary search technique to your mutex acquisition timeouts to find which one is the culprit.

Emile Cormier
  • 28,391
  • 15
  • 94
  • 122
  • It's not starvation, the cpu load drops to zero when the bug is triggered, seems more like a deadlock. On the other hand a deadlock seems unlikely as the program comprises a deadlock-detection mechanism. The threads are all rt because the program is only the controller-part of a larger system, non-rt stuff is running as a separate program. I do indeed know relatively exactly where the problem is, I can comment out some stuff and it goes away. But I still do not see what really happens. I guess there is a underlying design-flaw and I am eager to find it. – user761451 May 23 '11 at 15:53
  • Edited answer to address RT issues. – Emile Cormier May 23 '11 at 16:23
  • From your code snippet, it seemed like every RT thread was running at the same (presumably max) priority. – Emile Cormier May 23 '11 at 16:25
  • Isn't selecting the right thread the very reason for having priorities? The paradigm of our system is, that the controller has higher priority then everything else (We are moving some mechanical stuff with it). The controller threads are running at different priorities, allowing better reaction times for the more critical threads. – user761451 May 23 '11 at 16:26
  • Removed RT stuff from answer. – Emile Cormier May 23 '11 at 16:35
  • You are even more right... Switching the main thread to prio 99 just to wait for keyboard-input that never comes isn't the best idea ;) – user761451 May 23 '11 at 16:36
  • Hehe :-D I seem to remember that RT threads don't get pre-empted unless another higher priority thread wants to execute. Non-RT threads can always get pre-empted. That would explain why it works in non-RT mode. Is the keyboard input a loop that scans for keypresses? If so, it should sleep() between scans to let other threads run. – Emile Cormier May 23 '11 at 16:50
  • It has nothing to to with that main-thing, the input loop uses blocking io. (The system is not so new and its rolled out in some installations and working well. It's ongoing development that led me to my actual degree of confusion.) – user761451 May 23 '11 at 18:10