8

I received this .h5 file from a friend and I need to use the data in it for some work. All the data is numerical. This the first time I work with these kind of files. I found many questions and answers here about reading these files but I couldn't find a way to get to lower level of the groups or folders the file contains. The file contains two main folders, i.e. X and Y X contains a folder named 0 which contains two folders named A and B. Y contains ten folders named 1-10. The data I want to read is in A,B,1,2,..,10 for instance I start with

f = h5py.File(filename, 'r')
f.keys()

Now f returns [u'X', u'Y'] The two main folders

Then I try to read X and Y using read_direct but I get the error

AttributeError: 'Group' object has no attribute 'read_direct'

I try to create an object for X and Y as follows

obj1 = f['X']

obj2 = f['Y']

Then if I use command like

obj1.shape
obj1.dtype 

I get an error

AttributeError: 'Group' object has no attribute 'shape'

I can see that these command don't work because I use then on X and Y which are folders contains no data but other folders.

So my question is how to get down to the folders named A, B,1-10 to read the data

I couldn't find a way to do that even in the documentation http://docs.h5py.org/en/latest/quick.html

jpp
  • 159,742
  • 34
  • 281
  • 339
Mazin
  • 155
  • 1
  • 2
  • 9
  • Groups are like Python dictionaries. You have to keep indexing down through the groups until you reach a `dataset`. That has a `.shape`, and ability to download as a `numpy` array. `x = f["x']['foo']['bar'][...]` – hpaulj Jul 26 '18 at 23:12

1 Answers1

14

You need to traverse down your HDF5 hierarchy until you reach a dataset. Groups do not have a shape or type, datasets do.

Assuming you do not know your hierarchy structure in advance, you can use a recursive algorithm to yield, via an iterator, full paths to all available datasets in the form group1/group2/.../dataset. Below is an example.

import h5py

def traverse_datasets(hdf_file):

    def h5py_dataset_iterator(g, prefix=''):
        for key in g.keys():
            item = g[key]
            path = f'{prefix}/{key}'
            if isinstance(item, h5py.Dataset): # test for dataset
                yield (path, item)
            elif isinstance(item, h5py.Group): # test for group (go down)
                yield from h5py_dataset_iterator(item, path)

    for path, _ in h5py_dataset_iterator(hdf_file):
        yield path

You can, for example, iterate all dataset paths and output attributes which interest you:

with h5py.File(filename, 'r') as f:
    for dset in traverse_datasets(f):
        print('Path:', dset)
        print('Shape:', f[dset].shape)
        print('Data type:', f[dset].dtype)

Remember that, by default, arrays in HDF5 are not read entirely in memory. You can read into memory via arr = f[dset][:], where dset is the full path.

rgov
  • 3,516
  • 1
  • 31
  • 51
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Thank you so much. I tried the comment before your answer and that worked. obj1=f["X']['A']. I was wondering how to read the decond folder that has 10 sub folders. Because that's gonna change in the future to different numbers. SI I find your answer very helpful. I still get an invalid syntax error at the line: path = f'{prefix}/{key}'. Now if I want to save the data in the sub folders 1-10 using a loop instead of using 10 different commands especially as I said the number could change in the future – Mazin Jul 27 '18 at 00:05
  • 1
    You get a syntax error because f-strings only work in Python 3.6+. You can use `path = '{0}/{1}'.format(prefix, key)` instead. – jpp Jul 27 '18 at 01:09
  • I got an error at `from` in `yield from h5py_dataset_iterator(item, path)` when I removed it the code worked with no error but didn't print the date types or any of those attributes in `print` command. What is more important is how to read the ten datasets in the group Y? more for a loop to read them because the number of datasets in that group can be different and not always 10. Sorry I feel like bugging you with this question – Mazin Jul 27 '18 at 18:21
  • What version of Python are you using? `yield from` was introduced in v3.3. – jpp Jul 27 '18 at 18:54
  • mine is 2.7.12. seems too old comparing to 3.3 – Mazin Jul 27 '18 at 20:06