Creating reference to HDF dataset in H5py using astype

Question

From the h5py docs, I see that I can cast a HDF dataset as another type using astype method for the datasets. This returns a contextmanager which performs the conversion on-the-fly.

However, I would like to read in a dataset stored as uint16 and then cast it into float32 type. Thereafter, I would like to extract various slices from this dataset in a different function as the cast type float32. The docs explains the use as

with dataset.astype('float32'):
   castdata = dataset[:]

This would cause the entire dataset to be read in and converted to float32, which is not what I want. I would like to have a reference to the dataset, but cast as a float32 equivalent to numpy.astype. How do I create a reference to the .astype('float32') object so that I can pass it to another function for use?

An example:

import h5py as HDF
import numpy as np
intdata = (100*np.random.random(10)).astype('uint16')

# create the HDF dataset
def get_dataset_as_float():
    hf = HDF.File('data.h5', 'w')
    d = hf.create_dataset('data', data=intdata)
    print(d.dtype)
    # uint16

    with d.astype('float32'):
    # This won't work since the context expires. Returns a uint16 dataset reference
       return d

    # this works but causes the entire dataset to be read & converted
    # with d.astype('float32'):
    #   return d[:]

Furthermore, it seems like the astype context only applies when the data elements are accessed. This means that

def use_data():
   d = get_data_as_float()
   # this is a uint16 dataset

   # try to use it as a float32
   with d.astype('float32'):
       print(np.max(d))   # --> output is uint16
       print(np.max(d[:]))   # --> output is float32, but entire data is loaded

So is there not a numpy-esque way of using astype?

I don't think that `np.max(d)` is doing anything particularly clever here. Since `d` does not have its own `.max()` method, `np.max()` will read the array into memory and call `np.core.umath.maximum.reduce()` on it, using `d.dtype` to set the output type. The timings for `np.max(d)` and `np.max(d[:])` are near-identical. — ali_m, Aug 11 '14 at 13:43
@ali_m You may be right. I just chose np.max as way to look if an operation on the array returned the dtype. It's not important to my caluclations. I will mostly extract slices that I work with. — achennu, Aug 13 '14 at 06:44

ali_m · Accepted Answer · 2014-08-12T14:31:38.347

d.astype() returns an AstypeContext object. If you look at the source for AstypeContext you'll get a better idea of what's going on:

class AstypeContext(object):

    def __init__(self, dset, dtype):
        self._dset = dset
        self._dtype = numpy.dtype(dtype)

    def __enter__(self):
        self._dset._local.astype = self._dtype

    def __exit__(self, *args):
        self._dset._local.astype = None

When you enter the AstypeContext, the ._local.astype attribute of your dataset gets updated to the new desired type, and when you exit the context it gets changed back to its original value.

You can therefore get more or less the behaviour you're looking for like this:

def get_dataset_as_type(d, dtype='float32'):

    # creates a new Dataset instance that points to the same HDF5 identifier
    d_new = HDF.Dataset(d.id)

    # set the ._local.astype attribute to the desired output type
    d_new._local.astype = np.dtype(dtype)

    return d_new

When you now read from d_new, you will get float32 numpy arrays back rather than uint16:

d = hf.create_dataset('data', data=intdata)
d_new = get_dataset_as_type(d, dtype='float32')

print(d[:])
# array([81, 65, 33, 22, 67, 57, 94, 63, 89, 68], dtype=uint16)
print(d_new[:])
# array([ 81.,  65.,  33.,  22.,  67.,  57.,  94.,  63.,  89.,  68.], dtype=float32)

print(d.dtype, d_new.dtype)
# uint16, uint16

Note that this doesn't update the .dtype attribute of d_new (which seems to be immutable). If you also wanted to change the dtype attribute, you'd probably need to subclass h5py.Dataset in order to do so.

Interesting. I did look into the AsTypeContext, but was not sure if setting the dtype myself would have some undesirable consequences. Will do some testing and come back to this answer. Thank you. — achennu, Aug 13 '14 at 06:45
@arjmage I think it should be fine. `d._local` is a [`threading.local`](https://docs.python.org/2/library/threading.html#threading.local) object, so your changes ought to be thread-safe. You can see [here](https://github.com/h5py/h5py/blob/master/h5py/_hl/dataset.py#L394) that `d._local.dtype` is just used to set the `dtype` of the output numpy array that the data gets read out into. [`d.dtype` actually points to `d.id.dtype`](https://github.com/h5py/h5py/blob/master/h5py/_hl/dataset.py#L177), which is the identifier for the actual HDF5 object. — ali_m, Aug 13 '14 at 08:58

score 0 · Answer 2 · answered Aug 11 '14 at 13:17

The docs of astype seem to imply reading it all into a new location is its purpose. Thus your return d[:] is the most reasonable if you are to reuse the float-casting with many functions at seperate occasions.

If you know what you need the casting for and only need it once, you could switch things around and do something like:

def get_dataset_as_float(intdata, *funcs):
    with HDF.File('data.h5', 'w') as hf:
        d = hf.create_dataset('data', data=intdata)
        with d.astype('float32'):
            d2 = d[...]
            return tuple(f(d2) for f in funcs)

In any case, you want to make sure that hf is closed before leaving the function or else you will run into problems later on.

In general, I would suggest separating the casting and the loading/creating of the data-set entirely and passing the dataset as one of the function's parameters.

Above can be called as follows:

In [16]: get_dataset_as_float(intdata, np.min, np.max, np.mean)
Out[16]: (9.0, 87.0, 42.299999)

Creating reference to HDF dataset in H5py using astype

2 Answers2

Linked