2

I must read in and operate independently over many chunks of a large dataframe/numpy array. However, these chunks are chosen in a specific, non-uniform manner and are broken naturally into groups within a hdf5 file. Each group is small enough to fit into memory (though even without restriction, I suppose the standard chunking procedure should suffice.)

Specifically, instead of

 f = h5py.File('myfile.hdf5')
 x = da.from_array(f['/data'], chunks=(1000, 1000))

I want something closer to (pseudocode):

 f = h5py.File('myfile.hdf5')
 x = da.from_array(f, chunks=(f['/data1'], f['/data2'], ...,))

http://dask.pydata.org/en/latest/delayed-collections.html I believe hints this is possible but I am still reading into and understanding dask/hdf5.

My previous implementation uses a number of CSV files and reads them in as needed with its own multi-processing logic. I would like to collapse all this functionality into dask with hdf5.

Is chunking by hdf5 group/read possible and my line of thought ok?

Eric Kaschalk
  • 369
  • 1
  • 4
  • 8

1 Answers1

1

I would read many dask.arrays from many groups as single-chunk dask.arrays and then concatenate or stack those groups.

Read many dask.arrays

f = h5py.File(...)
dsets = [f[dset] for dset in datasets]
arrays = [da.from_array(dset, chunks=dset.shape) for dset in dsets]

Alternatively, use a lock to defend HDF5

HDF5 is not threadsafe, so lets use a lock to defend it from parallel reads. I haven't actually checked to see if this is necessary or not when reading across different groups.

from threading import Lock
lock = Lock()

arrays = [da.from_array(dset, chunks=dset.shape, lock=lock) 
           for dset in dsets]

Stack or Concatenate arrays together

array = da.concatenate(arrays, axis=0)

See http://dask.pydata.org/en/latest/array-stack.html

Or use dask.delayed

You could also, as you suggest, use dask.delayed to do the first step in reading single-chunk dask.arrays

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • The `chunks=dset.shape` was what I was missing conceptually. – Eric Kaschalk Oct 12 '16 at 20:05
  • A followup: I assume that the f[dset] is lazy? Also does this scale naturally should each group require chunking itself? – Eric Kaschalk Oct 12 '16 at 20:06
  • 1
    Yes, `f[dset]` is lazy-ish. Yes, this does scale naturally to chunking per group. – MRocklin Oct 12 '16 at 20:15
  • Last question: Are the datasets read, not just operated on, in parallel? I ask because I am unsure whether having each dset tied to the same descriptor will cause issues. – Eric Kaschalk Oct 13 '16 at 14:40
  • 1
    You can provide a lock to `da.from_array` if you'd like to protect your HDF5 file. `from threading import Lock; lock = Lock(); arrays = [da.from_array(..., lock=lock) for ...]` – MRocklin Oct 13 '16 at 15:54