I must read in and operate independently over many chunks of a large dataframe/numpy array. However, these chunks are chosen in a specific, non-uniform manner and are broken naturally into groups within a hdf5 file. Each group is small enough to fit into memory (though even without restriction, I suppose the standard chunking procedure should suffice.)
Specifically, instead of
f = h5py.File('myfile.hdf5')
x = da.from_array(f['/data'], chunks=(1000, 1000))
I want something closer to (pseudocode):
f = h5py.File('myfile.hdf5')
x = da.from_array(f, chunks=(f['/data1'], f['/data2'], ...,))
http://dask.pydata.org/en/latest/delayed-collections.html I believe hints this is possible but I am still reading into and understanding dask/hdf5.
My previous implementation uses a number of CSV files and reads them in as needed with its own multi-processing logic. I would like to collapse all this functionality into dask with hdf5.
Is chunking by hdf5 group/read possible and my line of thought ok?