0

I want to save a large number of fixed length string data in hdf5. Here is the code.

However, there is error message: TypeError: No conversion path for dtype: dtype('<U1')

Could you help to fix the reason? Thanks.

import numpy as np
import h5py

sdata = [ np.array(["A", "C", "A", "T", "C", "C", "T", "C"]), np.array(["G", "A", "C", "C", "C", "T", "A", "A"]), np.array(["G", "G", "A", "C", "C", "A", "A", "G"]) ]
sdata = np.array(sdata)

h5File = "test.h5"

with h5py.File(h5File, 'w') as h5data:
    h5data.create_dataset('sequence', data=sdata, compression="lzf", chunks=True, maxshape=(None,sdata.shape[1]))
#
ybzhao
  • 69
  • 9
  • I fixed this by encoding sdata before h5data.create_dataset: sdata = np.char.encode(sdata, "utf-8") – ybzhao Nov 04 '21 at 16:07
  • The problem is a HDF5 limitation. HDF5 saves characters as byte strings, and doesn't support Unicode strings (which is the default for NumPy arrays). Encoding is 1 work-around. You can also explicitly define the dataset dtype as a string with `dtype='S##'`. There are multiple posts about this in SO. Here is one: [How can I retrieve HDF5 dataset that is storing strings](https://stackoverflow.com/a/69542792/10462884) – kcw78 Nov 05 '21 at 14:01
  • Thank you so much @kcw78, I will test that. – ybzhao Nov 05 '21 at 21:55

0 Answers0