4

I have a multi-threads program which is running on Linux, sometimes if I run gstack against it, there is a thread was waiting for a lock for a long time(say, 2-3 minutes),

Thread 2 (Thread 0x5e502b90 (LWP 19853)):

0 0x40000410 in __kernel_vsyscall ()

1 0x400157b9 in __lll_lock_wait () from /lib/i686/nosegneg/libpthread.so.0

2 0x40010e1d in _L_lock_981 () from /lib/i686/nosegneg/libpthread.so.0

3 0x40010d3b in pthread_mutex_lock () from /lib/i686/nosegneg/libpthread.so.0

...

I checked the rest of the threads, none of them were taking this lock, however, after a while this thread (LWP 19853) could acquire this lock successfully.

There should exist one thread that had already acquired this lock, but I failed to find it, is there anything I missing?

EDIT: The definition of the pthread_mutex_t:

typedef union

{

struct __pthread_mutex_s {

int __lock;

unsigned int __count;

int __owner;

/* KIND must stay at this position in the structure to maintain binary compatibility. */

int __kind;

unsigned int __nusers;

extension union { int __spins; __pthread_slist_t __list; };

} __data;

char _size[_SIZEOF_PTHREAD_MUTEX_T];

long int __align;

} pthread_mutex_t;

There is a member "__owner", it is the id of the thread who is holding the mutex now.

Derui Si
  • 1,085
  • 1
  • 9
  • 13
  • Isn't it written on top of the output? `Thread 2` – RedX Jul 09 '12 at 08:36
  • Thread 2 is waiting for the lock. I want to find the thread that holding the lock right now, but failed. – Derui Si Jul 09 '12 at 08:49
  • possible duplicate of [Is it possible to determine the thread holding a mutex?](http://stackoverflow.com/questions/3483094/is-it-possible-to-determine-the-thread-holding-a-mutex) - the accepted answer there should help you. – caf Jul 12 '12 at 07:37

4 Answers4

2

2-3 minutes sounds a lot, but if your system is under heavy load, there is no guarantee that your thread wakes up immediately after another one has unlocked the mutex. So there might just be no thread (anymore) that holds the lock in the moment that you are looking at it.

Linux mutex work in two stages. Roughly:

  • At the first stage there is a atomic CAS operation on an int value to see if the mutex can be locked immediately.
  • If this is not possible a futex_wait system call with the address of the same int is passed to the kernel.

An unlock operation then consist in changing the value back to the initial value (usually 0) and doing a futex_wake system call. The kernel then looks if someone registered a futex_wait call on the same address, and revives those threads in the scheduling queue. Which thread the really gets woken up and when depends on different things, in particular the scheduling policy that is enabled. There is no guarantee that threads obtain the locks in the order they placed them.

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
  • This must be a really **heavy** loaded system to make a thread not wake up after minutes after it could have. Anyway I do agree to your answer! – alk Jul 09 '12 at 12:52
  • The thing here is that i reviewed the call stacks of each thread, none of the threads have already obtained this lock, so what i'm confusing is that why doesn't this thread (thread 2) lock this mutex immediately since no one have taken it. – Derui Si Jul 09 '12 at 14:13
  • @ageek2remember, are you sure that no thread uses `trylock` and that you initialize your mutex correctly? If you do, a way to know who is taking the lock would be to place a lock immediately after the initialization, let the execution proceed as normally until one of the threads blocks. That one then would be the culprit. – Jens Gustedt Jul 09 '12 at 14:32
  • Yes, i'm sure at least after reading the gstack output. This is in production, and i'm not able to reproduce in my lab.So changing the code to debug is not an option now.. – Derui Si Jul 09 '12 at 14:49
2

Mutexes by default don't track the thread that locked them. (Or at least I don't know of such a thing )

There are two ways to debug this kind of problem. One way is to log every lock and unlock. On every thread creation you log the value of the thread id that got created. Right after locking any lock, you log the thread id, and the name of the lock that was locked ( you can use file/line for this, or assign a name to each lock). And you log again right before unlocking any lock.

This is a fine way to do it if your program doesn't have tens of threads or more. After that the logs start to become unmanageable.

The other way is to wrap your lock in a class that stores the thread id in a lock object right after each lock. You might even create a global lock registry that tracks this, that you can print out when you need to.

Something like:

class MyMutex
{
public:
    void lock() { mMutex.lock(); mLockingThread = getThreadId(); }
    void unlock() { mLockingThread = 0; mMutex.unlock(); }
    SystemMutex mMutex;
    ThreadId    mLockingThread;
};

The key here is - don't implement either of these methods for your release version. Both a global locking log, or a global registry of lock states creates a single global resource that will itself become a resource under lock contention.

Rafael Baptista
  • 11,181
  • 5
  • 39
  • 59
  • If we look at the definition of the pthread_mutex_t, we can find that it has a member variable owner which can indicate the holding thread info. – Derui Si Jul 09 '12 at 14:32
  • If we can get a core dump of that running application, then we are able to find the thread who is taking the mutext right now. Usually, with the gstack output, I'm able to find the thread who is holding the mutex by reading the call stacks, but in this case, i failed to find it with only the gstack outputs available. – Derui Si Jul 09 '12 at 14:39
  • I don't know maybe there is a way to use pthread's own thread tracking. In that case I don't know why in some cases the tracking appears to fail. I know it is easy enough to track the threads yourself with some thread and mutex classes. – Rafael Baptista Jul 09 '12 at 14:46
0

The POSIX API doesn't contain a function that does it.

It's also possible that on some platforms, the implementation doesn't allow this.
For example, a lock can use an atomic variable, set to 1 when locked. The thread obtaining it doesn't have to write its ID anywhere, so no function can find it.

ugoren
  • 16,023
  • 3
  • 35
  • 65
0

For such debugging issues you might two add special logging calls to your program stating when which tread had aquired the lock and when it returned it.

Such log entries then will help you finding which thread aquired the lock last.

Anyway doing so might massivly change the run time behavior of the program and the issue to be debugged won't appear anymore outing itself as sort of a classical heisenbug as seen often in multi-threaded applications.

alk
  • 69,737
  • 10
  • 105
  • 255