1

I have to apply a 2D filter for every slice of a stack of images and I would like to parallelize the analysis. However, the code below runs slower than a normal for loop. Also, increasing n_jobs also increase the processing time, which is faster for n_jobs = 1 and slower for n_jobs = 6.

import numpy as np 
from joblib import Parallel, delayed
from skimage.restoration import denoise_tv_chambolle

arr = np.random.rand(50,50,50)

def f(arr):
    arr_h = denoise_tv_chambolle(arr, weight=0.1, multichannel=True)
    return arr_h

Parallel(n_jobs=6, backend="threading")(delayed(f)(i) for i in arr)
user3666197
  • 1
  • 6
  • 50
  • 92
Sav
  • 142
  • 1
  • 17

1 Answers1

0

Q : ( Why ) ... runs slower than a normal for loop ( ? )

>>> import numpy as np; _ = np.random.rand( 50, 50, 50)
>>> from zmq import Stopwatch; aClk = Stopwatch()
>>> 
>>> aClk.start(); r = denoise_tv_chambolle( _, weight = 0.1, multichannel = True ); b = aClk.stop(); print( "The code took {0: > 9d}[us]".format( b ) )
The code took    679749[us]
The code took    683137[us]
The code took    678925[us]
The code took    688936[us]

Given the miniature data shape (50,50,50)-of-float64, the in-cache computing is The Key for performance. Using a joblib.Parallel with a 'threading' backend is rather anti-pattern ( python uses GIL-lock so as to re-[SERIAL]-ise the computing one-step-after-another, as it avoids any kind of common, concurrency-related, collision ). Such serial flow of computing is even worse here, because the "switching" one-step-after-another is coming at an additional cost ( not improving the original purely-[SERIAL] code execution - so you pay more to receive the same (yet, after a longer time) )

Q : increasing n_jobs also increase the processing time

Sure, it increases the amount of wasted time for the GIL-lock re-[SERIAL]-isation overheads as there are more one-step-after-another GIL-directed collision-avoidance "switching"-transitions.


Last but not least

Even if going into a fully fledged parallelism, using the process-based parallelism ( avoids the costs of GIL-locking ), it comes ( again at a cost - process-instantiation cost ( a full 1:1 memory-copy of the python-interpreter process n_jobs-times in Win O/S, similarly in linux O/S - as documented in joblib module, incl. recommendations to avoid some other forms of spawning parallel processes ), parameter data-transfer-cost, result-transfer-cost ).

If one adds all these add-on costs for n_jobs = 6, and if these costs were accrued in the name of just a miniature computing-task ( as small as a ~ 680 [ms] in duration ), one will soon result to pay way more to setup the parallel-processing than will ever receive back ( as other effect - as a worse-than original cache-re-uses - will not "increase" the speed of computing ).

The real-world costs ( and a due accounting for each class-of-(all-such)-costs ) of computing payloads is the reason ( Why ) ... runs slower

user3666197
  • 1
  • 6
  • 50
  • 92
  • So, is there no way to make the process faster? I was trying np.memmap right now but it is still slow. – Sav Nov 04 '19 at 18:33
  • **`np.memmap()`-s are `~ 10 [ms]` per random-access**, in-cache data with a smart-re-use can be got in **`~ 0.5 [ns]`** ... so you try to get things many orders of magnitude **even worse with `np.memmap()`-s** – user3666197 Nov 04 '19 at 18:46