4

I have a program which is doing something very interesting. Basicly i have three main threads, all are busy and the problem thread basiclly has a while loop that looks like this:

while(variable which is always true) {
    get some data;
    process data;
    print message;
}
print end message;

Now the print message gets printed for about the first 400 rounds but after this the thread just stops running. I have tried making this thread a high priority. I have tried reducing the priority of other threads. The strangest thing is that this was working previously and while debugging another problem (which was done only with print statements) it started to happen intermittently. Now it has become permanent, as in it happens every time.

Things that solved it temporarily (then it seemingly stopped working again) were:

  1. Reducing the level of print statements in other threads.

  2. Reducing the priority of other threads.

  3. Running at different times.

As a disclaimer I am using a real time operating system called QNX which may be causing the problem directly. I am hoping its not this otherwise we have a big core i7 and no ability to use threads.

Unfortunately the code is very long and is for work so I really shouldn't post it. I am hoping for help that will point me in the correct direction to solving it myself. Does anyone know of issues that can cause these symptoms?

Fantastic Mr Fox
  • 32,495
  • 27
  • 95
  • 175
  • This very much sounds like a race condition. Do you use any synchronization mechanisms like mutexes or semaphores? If so you might be running into a deadlock somewhere – Chris Nov 12 '12 at 06:58
  • right, so could this be caused by an unprotected data type? it is combined code with other workers so some data types are unprotected but all mutex's that are used are definitely locked and unlocked. – Fantastic Mr Fox Nov 12 '12 at 06:59
  • Have you investigated whether a deadlock occurs, i.e. one thread waiting to acquire a lock (mutex etc.) while the thread that has the lock waits for the first thread to release another lock? – jogojapan Nov 12 '12 at 07:00
  • 1
    Are you sure that it stops working not just printing messages? – Öö Tiib Nov 12 '12 at 07:03
  • It stops printing in an infinite loop which makes me think it stops working, is there another way to test? – Fantastic Mr Fox Nov 12 '12 at 07:04
  • 2
    Unfortunately debugging threading issues is very complex. I'm afraid it's not possible to help you there without at least a bit of source code – Chris Nov 12 '12 at 07:05
  • 'Now it has become permanent, as in it happens every time' - good, now you have a chance! Unfortunately, as @Chris says, debugging your problem cannot be done by blog. It's unlikely that a commercial OS has such serious bugs, so it's down to your code. Even without the confidential/NDA issue, posting bulk code is not likely to be the best way of proceeding. The usual problem areas are inter-thread comms and drivers and we could not effectively debug those without an OS environment. Cut down your app until the problem goes away COMPLETELY, then start adding stuff back. – Martin James Nov 12 '12 at 09:26
  • Oh - approach 2 - find and take on a highly-experienced QNX contractor to solve your specific problem. A short-term contract on thse terms will not be cheap, but is sounds like you've spent a lot of time/money on this issue already. I would like to say I can do it for some €€€€, but I have no QNX:(( – Martin James Nov 12 '12 at 09:31
  • ..or pick out one possible area and test it to death. Start some,(50?), threads that only create messages pass them to the other threads and back again. Check the messages for integrity when they get back. Run it for a week. Similarly the print output - start some threads that only output text, run for a week. Same thing with the other drivers. When there are NO failures, you can progress to integerating the tests and adding your app-specific code. Miserable exercise, but until you can get to the root cause of your problem, you are stuft. – Martin James Nov 12 '12 at 09:42
  • If your code can cross-compile for another Unix OS, like Linux, you can use one of the many available thread checker utilities, e.g. Sun/Oracle Thread Analyzer (part of the free Solaris Studio). Otherwise search for thread checker for QNX or just put some prints around and try to determine the point where the thread stops (you could also use a debugger :)) – Hristo Iliev Nov 12 '12 at 10:06
  • @chris Very good, it was a race condition, i was mutex locking a class but the class had a certain point where it could throw an exception without unlocking, that exception was caught and the program continued causing a unlock-able lock. You should add that as an answer an i will mark you correct. – Fantastic Mr Fox Nov 13 '12 at 00:49
  • @Ben Can you use C++11? There is something called `std::lock_guard`, which you can use to prevent precisely this problem: http://en.cppreference.com/w/cpp/thread/lock_guard – jogojapan Nov 13 '12 at 06:20

2 Answers2

2

This very much sounds like a race condition.
Make sure, that there are no circumstances where a locked mutex is not properly unlocked, because that could lead to a deadlock.
Unfortunately debugging threading issues is very complex, so you're pretty much on your own here.

Chris
  • 1,613
  • 1
  • 18
  • 27
1

Threads Programming:Synchronization - http://www.cs.cf.ac.uk/Dave/C/node31.html

Read this, scoll down to Synchronization - http://www.qnx.com/developers/docs/6.3.0SP3/neutrino/sys_arch/kernel.html

http://www.qnx.com/developers/docs/6.3.0SP3/neutrino/sys_arch/kernel.html#SCHEDULING

If the same data is being read by all threads then use lock.

pthread_mutex_lock( &m );
. . .
while (!arbitrary_condition) {
    pthread_cond_wait( &cv, &m );
    }
. . .
pthread_mutex_unlock( &m );
Software_Designer
  • 8,490
  • 3
  • 24
  • 28