Q : ( Why ) ... runs slower than a normal for loop ( ? )
>>> import numpy as np; _ = np.random.rand( 50, 50, 50)
>>> from zmq import Stopwatch; aClk = Stopwatch()
>>>
>>> aClk.start(); r = denoise_tv_chambolle( _, weight = 0.1, multichannel = True ); b = aClk.stop(); print( "The code took {0: > 9d}[us]".format( b ) )
The code took 679749[us]
The code took 683137[us]
The code took 678925[us]
The code took 688936[us]
Given the miniature data shape (50,50,50)
-of-float64
, the in-cache computing is The Key for performance. Using a joblib.Parallel
with a 'threading
' backend is rather anti-pattern ( python uses GIL
-lock so as to re-[SERIAL]
-ise the computing one-step-after-another, as it avoids any kind of common, concurrency-related, collision ). Such serial flow of computing is even worse here, because the "switching" one-step-after-another is coming at an additional cost ( not improving the original purely-[SERIAL]
code execution - so you pay more to receive the same (yet, after a longer time) )
Q : increasing n_jobs
also increase the processing time
Sure, it increases the amount of wasted time for the GIL-lock re-[SERIAL]
-isation overheads as there are more one-step-after-another
GIL-directed collision-avoidance "switching"-transitions.
Last but not least
Even if going into a fully fledged parallelism, using the process-based parallelism ( avoids the costs of GIL-locking ), it comes ( again at a cost - process-instantiation cost ( a full 1:1 memory-copy of the python-interpreter process n_jobs
-times in Win O/S, similarly in linux O/S - as documented in joblib
module, incl. recommendations to avoid some other forms of spawning parallel processes ), parameter data-transfer-cost, result-transfer-cost ).
If one adds all these add-on costs for n_jobs = 6
, and if these costs were accrued in the name of just a miniature computing-task ( as small as a ~ 680 [ms]
in duration ), one will soon result to pay way more to setup the parallel-processing than will ever receive back ( as other effect - as a worse-than original cache-re-uses - will not "increase" the speed of computing ).
The real-world costs ( and a due accounting for each class-of-(all-such)-costs ) of computing payloads is the reason ( Why ) ... runs slower