How to use Dask to parallelize object detection on a massive image on the cluster

Question

I am trying to see if i can use Dask for blockwise parallelization of the detection and segmentation of objects in massive 2D images (~20-50 GB) on a cluster.

My logic to detect/segment objects in an image block will be encapsulated in a function.

I came across a Dask function called map_blocks that lets me apply a custom function on each block/chunk of a dask array.

However, i see that the output type of the function i can pass to map_blocks should also be an array.

For object detection/segmentation, i would want my function to be able to return the coordinates of the bounding contour of each object found/detected in the block. Note that the number of objects in any block is unknown and depends on the image.

How can i solve this use case with map_blocks or something else in Dask?

score 1 · Answer 1 · answered Nov 22 '16 at 16:18

1

For more custom computations I recommend using dask.delayed which lets you parallelize fairly generic Python code.

If you have a dask.array you can turn it into a bunch of delayed objects with the .to_delayed() method

blocks = x.to_delayed()

You can then run arbitrary functions on these blocks however you like.

@dask.delayed
def process_block(block):
    ...

blocks = [[process_block(block) for block in row]
          for row in x.to_delayed().tolist()]

answered Nov 22 '16 at 16:18

MRocklin

55,641
23
163
235

Thanks for the answer. In the solution you suggest, is process_block free to return any type of object? And is it possible for the process_block function to know the block_id? – cdeepakroy Nov 22 '16 at 19:44
block_id would be needed so i can have my object detection function generate the bounding contour coords of the detected objects in global space instead of block space. – cdeepakroy Nov 22 '16 at 19:46
@MRocklin can you elaborate on this one as well? https://stackoverflow.com/questions/56586748/generating-batches-of-images-in-dask – enterML Jun 14 '19 at 10:45

score 1 · Answer 2 · answered Nov 25 '16 at 14:05

You could use an object array as output, with a chunkshape of (1,1). Be sure to add "dtype='object'" to your map_blocks call. Inside the mapped function, you then instantiate a (1,1) sized object array with a list of coordinates at (0,0). Like this:

def find_objects():
    # do logic
    result = np.empty((1,1), dtype='object')
    result[0,0] = coordinate_list
    return result

da_coords = da.map_blocks(find_objects, da_image, dtype='object')

Thanks for the suggestion. Will try this and get back here. – cdeepakroy Dec 01 '16 at 20:40 — cdeepakroy, Dec 01 '16 at 20:40

How to use Dask to parallelize object detection on a massive image on the cluster

2 Answers2