4

Background
I have a list with the paths of thousand image stacks (3D numpy arrays) preprocessed and saved as .npy binaries.

Case Study I would like to calculate the mean of all the images and in order to speed the analysis I thought to parallelise the processing.

Approach using dask.delayed

# List with the file names
flist_img_to_filter

# I chunk the list of paths in sublists. The number of chunks correspond to 
# the number of cores used for the analysis
chunked_list
# Scatter the images sublists to be able to process in parallel
futures = client.scatter(chunked_list)

# Create dask processing graph
output = []
for future in futures:
    ImgMean = delayed(partial_image_mean)(future)
    output.append(ImgMean)
    ImgMean_all = delayed(sum)(output)
    ImgMean_all = ImgMean_all/len(futures)

 # Compute the graph
 ImgMean = ImgMean_all.compute()

Approach using dask.arrays modified from Matthew Rocklin blog

imread = delayed(np.load, pure=True)  # Lazy version of imread
# Lazily evaluate imread on each path
lazy_values = [imread(img_path) for img_path in flist_img_to_filter]     

arrays = [da.from_delayed(lazy_value, dtype=np.uint16,shape=shape) for 
lazy_value in lazy_values]

# Stack all small Dask arrays into one
stack = da.stack(arrays, axis=0)

ImgMean = stack.mean(axis=0).compute()               

Questions

1. In the dask.delayed approach is it necessary to pre-chunk the list? If I scatter the original list I obtain a future for each element. Is there a way to tell a worker to process the futures it has access to?
2. The dask.arrays approach is significantly slower and with higher memory usage. Is this a 'bad way' to use dask.arrays?
3. Is there a better way to approach the issue?

Thanks!

s1mc0d3
  • 523
  • 2
  • 15

1 Answers1

0

In the dask.delayed approach is it necessary to pre-chunk the list? If I scatter the original list I obtain a future for each element. Is there a way to tell a worker to process the futures it has access to?

Simple answer is no, as of Dask version 0.15.4 there is no very robust way to submit a computation on "all of the tasks of a certain type currently present on this worker".

However, you can easily ask the scheduler which keys are present on the scheduler using the who_has or has_what client methods.

from dask.distributed import wait
import wait

futures = dask.persist(futures)
wait(futures)
client.who_has(futures)

The dask.arrays approach is significantly slower and with higher memory usage. Is this a 'bad way' to use dask.arrays?

You might want to play with the split_every= keyword of the mean function or else rechunk your array to group images together (probably similar to what yo do above) before calling mean to play with parallelism/memory tradeoffs.

Is there a better way to approach the issue?

You might also try as_completed and compute running means as data completes. You would have to switch from delayed to futures for this.

MRocklin
  • 55,641
  • 23
  • 163
  • 235