1

I'm using the command below to create an H5 dataset that stores an array of strings using dtype as S10 .

create_dataset(dset_name, (0,) , dtype='S10', maxshape=None, chunks=True)

It stores the data correctly in the hdf5 file's group. I can even see correct data in the HDF5 Viewer. However when I use group.keys() I can't see the dataset. The icon of the dataset also appears differently, as follows in the image below:

Also when I spring the dataset on the terminal, the output comes as [b'str', b'str2', b'str3', ...] strings int b'' format.

How can I retrieve such a dataset?

Check this link to see the difference in the icon of the dataset

NickS1
  • 496
  • 1
  • 6
  • 19
  • Since this seems to be primarily an issue of group and dataset names, your file creation description is incomplete, and I don't think we can help. In Python, `S10` bytestrings are displayed with the `b` tag. From the `png`, I'd expect `arr = f['g/g_var'][:]` to work. – hpaulj Oct 08 '21 at 16:05

1 Answers1

0

HDF5 (and h5py) store characters as byte strings, not as Unicode characters. As a result, you have to convert the dtype when going to/from HDF5 and Python. You can use .astype() on arrays or .encode()/.decode() in individual elements.

Here is a simple example to demonstrate the behavior. First it creates a file to mimic yours, then it extracts the data: once as default byte strings ('S10'), then using .astype('U') to convert the array to Unicode.

import h5py
import numpy as np

## Create a simple example file
with h5py.File('SO_69498550.h5','w') as h5w:
    grp = h5w.create_group('flower')
    iarr = np.arange(10)
    grp.create_dataset('g', data=iarr, maxshape=None, chunks=True)
    sarr = np.array( ['str0','str1','str2','str3','str4', \
                     'str5','str6','str7','str8','str9'], dtype='S10' )
    grp.create_dataset('g_var', data=sarr, maxshape=None, chunks=True)
    
## Open file and read data from string dataset: 'flower/g_var'
with h5py.File('SO_69498550.h5','r') as h5r:
    u_arr = h5r['flower/g_var'][:]
    print(f'u_arr dtype: {u_arr.dtype}') 
    print(u_arr)
    s_arr = h5r['flower/g_var'][:].astype('U')
    print(f's_arr dtype: {s_arr.dtype}') 
    print(s_arr)
kcw78
  • 7,131
  • 3
  • 12
  • 44