An obscure lock-up between two threads seems to be related to the Global Interpreter Lock or some other "behind-the-scenes-lock", and I am at a loss how to continue troubleshooting. Any tips how to eliminate the lock-up would be appreciated.
The issue reproduces (erratically and somewhat random) in a larger set of code. The code is strictly python. Python version is 2.6.5 (on Linux). Hours of troubleshooting has reduced the problem when the lock-up occurs to the following:
- The program has only two running threads
- The threads concurrently call a method protected by a single threading.RLock
- Thread 1 has acquired the lock [plus some other locks] via acquire()
- Thread 2 has called acquire() and is confirmed to be waiting for the lock
- Thread 1 is able to do print to the console with print(), however it is getting locked up by a simple non-blocking library call
The offensive call in #5 is the function unicode.encode, which should be non-blocking. The following code in Thread 1 at the position where the thread locks will (as expected) print 'A' and 'B':
print('A')
print('B')
However, the following will just print 'A' and block the thread:
print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')
This makes no sense to me at all. There is no logical dead-lock condition existing between the two threads. Thread 1 is being blocked by a non-blocking library call that is not in any way interfering with Thread 2, which is just silently waiting to acquire an RLock. The only cause I can think of for Thread 1 being blocked is that it is waiting for the GIL.
Any thoughts how to troubleshoot this further, or any mechanisms to somehow control or manipulate GIL operation as a workaround?
Edit: Some additional information in response to samplebias (and thank you for the reply). I had problems getting a trace as the issue seems to be very sensitive to anything that may disturb the timing between the two threads. However running strace with only the -f option, after a few iterations I was able to get a trace.
Thread 1 contains these three debug statements which should print two lines 'CHECK_IN' and 'CHECK_TEST' to the console:
print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG
Here is the last page of the strace:
8605 mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605 mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605 mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605 clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606 set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605 futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... set_robust_list resumed> ) = 0
8606 futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 gettimeofday( <unfinished ...>
8605 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605 gettimeofday( <unfinished ...>
8606 <... futex resumed> ) = 0
8605 <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
8605 <... futex resumed> ) = 0
8606 gettimeofday( <unfinished ...>
8605 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605 futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... futex resumed> ) = 0
8606 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606 write(1, "CHECK IN\n", 9) = 9
8606 futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL
And the output of the three lines of code is just the following, before the program locks up:
CHECK IN
So the strace shows how Thread 1 (#8606) writes the 'CHECK_IN' string, and when reaching the unicode.encode call enters a waiting state which never returns.
By the way I am making a couple future imports in all modules to keep with some newer python conventions ...
from __future__ import print_function, unicode_literals
... but I cannot see they should make any difference - especially as the u'hello' string is called out explicitly as a unicode string.