I have a workflow that I want to parallelize with dask, but am struggling to find an effective way to do so. I have profiled the code and identified the aspects that I want to speed up. Here it is in psuedocode:
for i in range(n):
x = expensive_iris_extract_operation(i)
r = cheaper_numpy_based_operation(x)
save_to_disk(r)
The numpy based operation doesn't need parallelizing, it accounts for less than 5% of the runtime when run on my local machine. What I want to speed up is the for loop by running each iteration on a separate process (specifically the expensive extract operation is where the majority of time is spent). So my first attempt was to use a dask bag over range(n) and map
x = very_expensive_iris_extract_operation(i)
r = cheap_numpy_based_operation(x)
save_to_disk(r)
To it as a function of i.
I sent it to my slurm based computing cluster to run on n processes. It successfully spawned n processes, but timed out on the numpy operations. Looking into it, I think it was because each process was only running one thread. When I run the code without dask, I only get 1 process, but multiple threads, and the numpy based operations are fast. I tried running the code using the threaded scheduler instead, it then ran, but only with one process, so there was no speedup from using the multi processor cluster. (It was in fact slower than running without dask).
So I guess my issue is I want to take advantage of multiprocessing, but if I do so, it disables multithreading, which I also need.
Is there any other way I can approach this?? Thanks
Edit: I tried using Python's native multiprocessing module instead, but ran into the same issues. If I map the iterables in the loop to processes in a process pool the function grinds to a halt (well, not a halt, but very very slow) when it hits the numpy operations, (in this case an np.apply_along_axis operation)) It's faster to just run in serial.
Edit 2: I solved this using a workaround by refactoring my numpy operations so they just happened in serial outside of the loop. (And my iris operation happens in parallel in multiple processes.) This works, but it still seems odd to me that the multithreading that numpy does doesn't appear to work from inside a child process.