Using the example on http://dask.pydata.org/en/latest/array-creation.html
filenames = sorted(glob('2015-*-*.hdf5')
dsets = [h5py.File(fn)['/data'] for fn in filenames]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
x = da.concatenate(arrays, axis=0) # Concatenate arrays along first axis
I'm having trouble understanding the next line and whether its a dask_array of "dask arrays" or a "normal" np array which points to as many dask arrays as there were datasets in all the hdf5 files that gets returned.
Is there any increase in performance (thread or memory based) during the file read stage as a result of the da.from_array
or is only when you concatenate into the dask array x
where you should expect improvements