Introduction
I have an image stack (ImgStack
) made of 42 planes each of 2048x2048 px and a function that I use for the analysis:
def All(ImgStack):
some filtering
more filtering
I determined that the most efficient way to process the array with dask (on my computer) is to make chunks=(21,256,256)
.
When I run map_blocks
:
now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
g=da.ghost.ghost(z, depth={0:10, 1:50,2:50},boundary={0: 'periodic',1:'periodic',2:'periodic'})
g2=g.map_blocks(All)
result = da.ghost.trim_internal(g2, {0: 10, 1: 50,2:50})
print('Time=',str(time.time()-now))
Time= 1.7090258598327637
Instead when I run map_overlap
now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
y=z.map_overlap(All,depth={0:10, 1:50,2:50},boundary={0: 'periodic', 1: 'periodic',2:'periodic'})
y.compute()
print('Time=',str(time.time()-now))
Time= 228.19104409217834
I guess the big time difference is due to the conversion from dask.array to np.array in map_overlap because if I add the conversion step to the map_block script the execution time became comparable.
now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
g=da.ghost.ghost(z, depth={0:10, 1:50,2:50},boundary={0: 'periodic', 1: 'periodic',2:'periodic'})
g2=g.map_blocks(All)
result = da.ghost.trim_internal(g2, {0: 10, 1: 50,2:50})
I=np.array(result)
print('Time=',str(time.time()-now))
Time= 209.68917989730835
Issue
So the best way will be to keep the dask.array but the problem shows up when I am saving the data on h5 file:
now=time.time()
result.to_hdf5('/Users/simone/Downloads/test.h5','/Dask2',compression='lzf')
print('Time=',str(time.time()-now))
Time= 243.1597340106964
but if I save the corresponding np.array
test=h5.File('/Users/simone/Downloads/test.h5','r+')
DT=test.require_group('NP')
DT.create_dataset('t', data=I,dtype=I.dtype,compression="lzf")
now=time.time()
print('Time=',str(time.time()-now))
Time= Time= 4.887580871582031e-05
Question
So I would like to be able to run the filtering and saving the arrays in the lowest amount of time possible. Is there a way to speed up the conversion from dask.array--> np.array() or to speed up the da.to_hdf5?
Thanks! Any comment will be appreciated.