Understanding the process of loading multiple file contents into Dask Array and how it scales

Question

Using the example on http://dask.pydata.org/en/latest/array-creation.html

filenames = sorted(glob('2015-*-*.hdf5')
dsets = [h5py.File(fn)['/data'] for fn in filenames]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
x = da.concatenate(arrays, axis=0)  # Concatenate arrays along first axis

I'm having trouble understanding the next line and whether its a dask_array of "dask arrays" or a "normal" np array which points to as many dask arrays as there were datasets in all the hdf5 files that gets returned.

Is there any increase in performance (thread or memory based) during the file read stage as a result of the da.from_array or is only when you concatenate into the dask array x where you should expect improvements

MRocklin · Accepted Answer · 2016-08-26T21:24:24.263

The objects in the arrays list are all dask arrays, one for each file.

The x object is also a dask array that combines all of the results of the dask arrays in the arrays list. It isn't a dask.array of dask arrays, it's just a single flattened dask array with an a larger first dimension.

There will probably not be an increase in performance for reading data. You're likely to be I/O bound by your disk bandwidth. Most people in this situation are using dask.array because they have more data than can conveniently fit into RAM. If this isn't valuable to you then I would stick with NumPy.

Understanding the process of loading multiple file contents into Dask Array and how it scales

1 Answers1

Linked