How to troubleshoot/bypass locking problem which appears to be GIL related

Question

An obscure lock-up between two threads seems to be related to the Global Interpreter Lock or some other "behind-the-scenes-lock", and I am at a loss how to continue troubleshooting. Any tips how to eliminate the lock-up would be appreciated.

The issue reproduces (erratically and somewhat random) in a larger set of code. The code is strictly python. Python version is 2.6.5 (on Linux). Hours of troubleshooting has reduced the problem when the lock-up occurs to the following:

The program has only two running threads
The threads concurrently call a method protected by a single threading.RLock
Thread 1 has acquired the lock [plus some other locks] via acquire()
Thread 2 has called acquire() and is confirmed to be waiting for the lock
Thread 1 is able to do print to the console with print(), however it is getting locked up by a simple non-blocking library call

The offensive call in #5 is the function unicode.encode, which should be non-blocking. The following code in Thread 1 at the position where the thread locks will (as expected) print 'A' and 'B':

print('A')
print('B')

However, the following will just print 'A' and block the thread:

print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')

This makes no sense to me at all. There is no logical dead-lock condition existing between the two threads. Thread 1 is being blocked by a non-blocking library call that is not in any way interfering with Thread 2, which is just silently waiting to acquire an RLock. The only cause I can think of for Thread 1 being blocked is that it is waiting for the GIL.

Any thoughts how to troubleshoot this further, or any mechanisms to somehow control or manipulate GIL operation as a workaround?

Edit: Some additional information in response to samplebias (and thank you for the reply). I had problems getting a trace as the issue seems to be very sensitive to anything that may disturb the timing between the two threads. However running strace with only the -f option, after a few iterations I was able to get a trace.

Thread 1 contains these three debug statements which should print two lines 'CHECK_IN' and 'CHECK_TEST' to the console:

print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG

Here is the last page of the strace:

8605  mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605  mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605  mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605  clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606  set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605  futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... set_robust_list resumed> )   = 0
8606  futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  gettimeofday( <unfinished ...>
8606  <... futex resumed> )             = 0
8605  <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606  <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... futex resumed> )             = 0
8606  stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606  fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606  mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606  write(1, "CHECK IN\n", 9)         = 9
8606  futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL

And the output of the three lines of code is just the following, before the program locks up:

CHECK IN

So the strace shows how Thread 1 (#8606) writes the 'CHECK_IN' string, and when reaching the unicode.encode call enters a waiting state which never returns.

By the way I am making a couple future imports in all modules to keep with some newer python conventions ...

from __future__ import print_function, unicode_literals

... but I cannot see they should make any difference - especially as the u'hello' string is called out explicitly as a unicode string.

Still not answer? I've the same in my project - two threads serving Web requests. Nothing seems to cause the deadlock, but the trace is pretty same. — PoltoS, May 07 '12 at 20:31

score 3 · Answer 1 · answered Mar 30 '11 at 18:04

I can find nothing in the Python source which would cause unicode.encode() to block, and the dummy program I wrote to try to reproduce this runs as expected. You mentioned that Thread 1 has acquired more than 1 lock - have you eliminated those locks as the source of the lock-up?

Does the test case below show the same lock-up in your environment?

import time
import threading

def worker(tid):
    _lock.acquire()
    if not tid:
        # wait for rest of threads to enter acquire
        time.sleep(0.5)    
    print('%d: A' % tid)
    u'hello'.encode('utf-8')
    print('%d: B' % tid)
    _lock.release()

def start(tid):
    th = threading.Thread(target=worker, args=(tid,))
    th.start()
    return th

_num = 2
_lock = threading.RLock()
workers = [start(n) for n in range(_num)]
while all(w.isAlive() for w in workers):
    time.sleep(1)

Output:

0: A
0: B
1: A
1: B

You can also run strace on your program to work out where the process is blocked. For example with the above script:

% strace -fTr -o trace.out python lockup.py

The -o trace.out flag tells strace to write the output to a file. You can omit that and strace will print to stderr.

The contents of trace.out should show you all of the system calls made by the program, with each line prefixed with thread id and the relative time between system calls. The end of the line will have the time spent inside that system call. I've annotated the last few syscalls with the corresponding Python code:

# thread 0 time.sleep(0.5) completes
24778      0.500124 <... select resumed> ) = 0 (Timeout) <0.500599>
# preparing to print()
24778      0.000071 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000017>
24778      0.000058 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fe90a6000 <0.000018>
# print("0: A\n")..
24778      0.000079 write(1, "0: A\n", 5) = 5 <0.000023>
24778      0.000106 write(1, "0: B\n", 5) = 5 <0.000056>
# thread 0 _lock.release()
24778      0.000114 futex(0xe0f3c0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000024>
24778      0.000108 madvise(0x7f8fe7266000, 8368128, MADV_DONTNEED) = 0 <0.000030>
# thread 0 exit
24778      0.000072 _exit(0)            = ?
# thread 1 _lock.acquire()
24779      0.000050 <... futex resumed> ) = 0 <0.500774>
# thread 1 print("1: A\n") and so on..
24779      0.000052 write(1, "1: A\n", 5) = 5 <0.000026>
24779      0.000086 write(1, "1: B\n", 5) = 5 <0.000026>
24779      0.000099 madvise(0x7f8fe6a65000, 8368128, MADV_DONTNEED) = 0 <0.000024>
24779      0.000064 _exit(0)            = ?
24777      0.499956 <... select resumed> ) = 0 (Timeout) <1.001138>
24777      0.000132 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f8fe8c7c8f0}, {0x4d9a90, [], SA_RESTORER, 0x7f8fe8c7c8f0}, 8) = 0 <0.000025>
# main thread process exit
24777      0.002349 exit_group(0)       = ?

Thank you @samplebias for your thoughtful response. The test case you provided ran fine in my environment with the expected output. Good suggestion to run an strace, after some tweaking I was able to get a trace and I will add a trace and comments to my post in a few mins. [edit: forgot to answer - no, there are no other locks that should cause any deadlock between the threads] — CptJeanLuc, Mar 30 '11 at 23:47
Hmm if my example runs and your code still locks up, perhaps you can post your code or describe the salient differences between them? — samplebias, Mar 31 '11 at 00:36
Can unfortunately not post the code. Should be no salient differences. The two threads call the same object method which executes "with self: ...", and the object has __enter__ and __exit__ methods to acquire and release an RLock which is an attribute on the object. I am suspecting this race condition is an unintented feature of the python interpreter. Thread 2 waiting for thread 1, thread 1 waiting for the interpreter, and the interpreter seems to have decided it does not want anything to do with thread 1. — CptJeanLuc, Mar 31 '11 at 08:14
I meant `__enter__` and `__exit__`, commenting system decided to format those as bold — CptJeanLuc, Mar 31 '11 at 08:21

How to troubleshoot/bypass locking problem which appears to be GIL related

1 Answers1