how to create variable length ascii encoded string with h5py

Question

I want to write a variable-length string using python with h5py. If I use

dset = grp.create_dataset('data_set_name',{1},dtype=h5py.special_dtype(vlen=str))
dset[0] = 'some_string'

then h5dump tells me

   DATASET "data_set_name" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "some_string"
      }
   }

that utf-8 character encoding has been used. However, I want normal ascii encoding, when h5dump would state

   DATASET "data_set_name" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "some_string"
      }
   }

How can I achieve that using h5py? Or is this impossible?

hpaulj · Accepted Answer · 2017-04-19T16:19:52.047

In a Py3 session where unicode is the default string type, I have to save ASCII as bytestrings. Drawing ideas from http://docs.h5py.org/en/latest/strings.html

In [1]: import h5py
In [2]: f=h5py.File('test.h5','w')
In [3]: grp=f.create_group('test')

In [5]: dset=grp.create_dataset('string',(3,), dtype=h5py.special_dtype(vlen=str))
In [6]: dset
Out[6]: <HDF5 dataset "string": shape (3,), type "|O">
In [7]: dset[0]='astring'
In [8]: dset[1]=b'astring'
In [9]: dset
Out[9]: <HDF5 dataset "string": shape (3,), type "|O">
In [10]: dset[:]
Out[10]: array(['astring', 'astring', ''], dtype=object)

So both Python string types got saved as unicode.

In [11]: dset.attrs['string']='unicode string'
In [12]: dset.attrs['bytes']=b'byte string'
In [13]: dset
Out[13]: <HDF5 dataset "string": shape (3,), type "|O">
In [14]: dset.attrs
Out[14]: <Attributes of HDF5 object at 2880654668>
In [15]: list(dset.attrs.items())
Out[15]: [('string', 'unicode string'), ('bytes', b'byte string')]

For attributes, string type is preserved.

In [16]: dset2=grp.create_dataset('bstring', (3,), dtype=h5py.special_dtype(vlen=bytes))
In [17]: dset2[0]='astring'
In [19]: dset2[1]=b'astring'
In [22]: dset2[:]
Out[22]: array([b'astring', b'astring', b''], dtype=object)

This wrote bytestrings both times.

In [25]: f.close()

And the dump

In [26]: !h5dump test.h5
HDF5 "test.h5" {
GROUP "/" {
   GROUP "test" {
      DATASET "bstring" {
         DATATYPE  H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): "astring", "astring", NULL
         }
      }
      DATASET "string" {
         DATATYPE  H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): "astring", "astring", NULL
         }
         ATTRIBUTE "bytes" {
            DATATYPE  H5T_STRING {
               STRSIZE H5T_VARIABLE;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
            DATA {
            (0): "byte string"
            }
         }
         ATTRIBUTE "string" {
            DATATYPE  H5T_STRING {
               STRSIZE H5T_VARIABLE;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_UTF8;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
            DATA {
            (0): "unicode string"
            }
         }
      }
   }
}
}

So specifying bytes instead of str does the trick in Py3.

how to create variable length ascii encoded string with h5py

1 Answers1