Q: Is the GIL ever unlocked for non IO-bound work?...yes, it can ( in several ways )
This was the O/P question, right, wasn't it?
So, let's get and solve it - python is an interpreted language. The python interpreter, by design, uses the GIL a.k.a. the Global Interpreter Lock ( i.e., it is its python-internal-only "LOCK-device" and has nothing to do with other, O/S-locks, IO-locks et al ).
The GIL-lock is a soft-signalling tool internally used inside the python interpreter, so as to coordinate its own work and to principally avoid any concurrency-originated collisions ( to avoid two attempts to write a value into some variable or to avoid an attempt to read, potentially an "old" value from a variable, that is "currently" being written a "new" value into ), thus sort of artificially introducing a deterministic, purely sequential, principally never-colliding, ordering of such internal python operations.
This means all python threads will obey GIL-based signalling and concurrency is therefore set, for whatever pool of python-GIL-still-coordinated threads to 1. So, except where IO-related waitings introduce "natural" ( external device originated must for a ) waiting state ( where such "naturally" waiting thread will signal by a released GIL-lock the python, to "lend" such a thread's wait-state time rather to some other python-thread to do something useful, the same logic for computing-intensive thread-processing has no sense, as none of the python-threads inside such a computing-pool have any "externally" introduced "natural" wait-states, but need the very opposite - as much scheduled processor-time as possible ... but the damned GIL-plays a round-robin pure-[SERIAL]
sequence of CPU working with python-threads one-after-another: tA-tB-tC-tD-tE-...-tA-tB-tC-tD-tE-...
thus efficiently avoiding any and all of the potential [CONCURRENT]
process-scheduling benefits.
"Why does the following example of CPU-bound code is running in parallel and never blocks?"
Well, still all has being executed as a "pure"-[SERIAL]
sequence of small amounts of time, during each of which the CPU is working on one and the only one python thread, internally disrupted after each GIL-lock release duration was spent, so the result seems that all the work is "quasi"-concurrently worked on ( yet still a sequence of execution of the actual work, that was supersampled into small time-execution quanta-of-work and performed one after another till the work was finished ).
So, the python threads actually pay a lot of overhead-costs ( reading, re-reading, at some time POSACK'd acquiring and later forcefully releasing the python in-software GIL-lock ), which costs you a deal of performance-overhead, but you receive nothing in exchange to all those many-threads executed overhead processing. Nothing, but worse performance ( Q.E.D. above in @galaxyan test results )
You would have felt that on your own,if not calling a simple fib(32)
but some more demanding computation like to eval something a bit more demanding:
( len( str( [ np.math.factorial( 2**f ) for f in range( 20 ) ][-1] ) ) )
( btw. note that the fib()
cannot be a way to go here, as its recursive formulation will soon on something like fib( 10**N )
start crashing right after your N
grows over the limit of the python interpreter configuration threshold, set for the python maximum recursion depth limit ...
def aCrashTestDemoWORKER( id, aMaxNUMBER ):
MASKs = "INF: {2:} tid[{0:2d}]:: fib(2:{1:}) processing start..."
MASKe = "INF: {2:} tid[{0:2d}]:: fib(2:{1:}) processing ended...{3:}"
safeM = 10**max( 2, aMaxNUMBER )
pass; print( MASKs.format( id, safeM, datetime.datetime.utcnow() ) )
len( [ fib( someN ) for someN in range( safeM ) ] )
pass; print( MASKe.format( id, safeM, datetime.datetime.utcnow(), 20*"_" ) )
Q: Is the GIL ever unlocked for non IO-bound work?
Yes, it can be - some work can be done, indeed GIL
-free.
One, harder to arrange, is to rather use multiprocessing
with sub-process based backend - this avoids GIL-locking, yet you pay quite remarkable price by allocating as many full-copies of the whole python-session state ( interpreter + all imported modules + all internal data-structures, whether needed for such distributed computations or not ) plus your ( now INTER-PROCESS ) communications performs serialisation / deserialisations before / after sending even a single bit of information there or back ( that is painful ). For the details on the actual "Economy"-of-costs, one may like to read the Amdahl's law re-formulation, that reflects impacts both from these overheads and atomic-processing durations.
Another case is, when using numba.jit()
compiled or pre-compiled code, where smart numba
-based LLVM-compiler may get instructed in a decorator with call signature(s) and other details to work in a nogil = true
mode, so as to generate a code, that need not use the ( expensive ) GIL-signalling, where appropriate to ask for such comfort.
The last case is to move into a heterogeneous distributed computing design, where python remains a coordinator and remote, distributed computing units are GIL-free number crunchers, where python internal GIL-logic has no meaning and is by design ignored.
BONUS PART:
For more details on computing-intensive performance tricks, you may like ( this post on monitoring threads' overheads )