Difference in processing time between map_block and map_overlap is it due to dask.array to np.array conversion?

Question

Introduction

I have an image stack (ImgStack) made of 42 planes each of 2048x2048 px and a function that I use for the analysis:

def All(ImgStack):
    some filtering
    more filtering

I determined that the most efficient way to process the array with dask (on my computer) is to make chunks=(21,256,256).

When I run map_blocks:

now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
g=da.ghost.ghost(z, depth={0:10, 1:50,2:50},boundary={0: 'periodic',1:'periodic',2:'periodic'})
g2=g.map_blocks(All)
result = da.ghost.trim_internal(g2, {0: 10, 1: 50,2:50})
print('Time=',str(time.time()-now))

Time= 1.7090258598327637

Instead when I run map_overlap

now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
y=z.map_overlap(All,depth={0:10, 1:50,2:50},boundary={0: 'periodic', 1: 'periodic',2:'periodic'})
y.compute()
print('Time=',str(time.time()-now))

Time= 228.19104409217834

I guess the big time difference is due to the conversion from dask.array to np.array in map_overlap because if I add the conversion step to the map_block script the execution time became comparable.

now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
g=da.ghost.ghost(z, depth={0:10, 1:50,2:50},boundary={0: 'periodic', 1: 'periodic',2:'periodic'})
g2=g.map_blocks(All)
result = da.ghost.trim_internal(g2, {0: 10, 1: 50,2:50})
I=np.array(result)
print('Time=',str(time.time()-now))

Time= 209.68917989730835

Issue

So the best way will be to keep the dask.array but the problem shows up when I am saving the data on h5 file:

now=time.time()
result.to_hdf5('/Users/simone/Downloads/test.h5','/Dask2',compression='lzf')
print('Time=',str(time.time()-now))

Time= 243.1597340106964

but if I save the corresponding np.array

test=h5.File('/Users/simone/Downloads/test.h5','r+')
DT=test.require_group('NP')
DT.create_dataset('t', data=I,dtype=I.dtype,compression="lzf")
now=time.time()
print('Time=',str(time.time()-now))

Time= Time= 4.887580871582031e-05

Question

So I would like to be able to run the filtering and saving the arrays in the lowest amount of time possible. Is there a way to speed up the conversion from dask.array--> np.array() or to speed up the da.to_hdf5?

Thanks! Any comment will be appreciated.

score 3 · Accepted Answer · answered Apr 01 '16 at 15:28

3

In your fast examples you never actually compute the result. The one second is just time spent to set up the computational graph. To me it looks like your computation genuinely takes 200 seconds or so.

If you wanted to better understand what is taking up time you could try using the dask profiler.

answered Apr 01 '16 at 15:28

MRocklin

55,641
23
163
235

Thanks for the help, you are correct I completely missed the most important step. – s1mc0d3 Apr 01 '16 at 15:43

Difference in processing time between map_block and map_overlap is it due to dask.array to np.array conversion?

Introduction

Issue

Question

1 Answers1