In a Py3 session where unicode is the default string type, I have to save ASCII as bytestrings. Drawing ideas from http://docs.h5py.org/en/latest/strings.html
In [1]: import h5py
In [2]: f=h5py.File('test.h5','w')
In [3]: grp=f.create_group('test')
In [5]: dset=grp.create_dataset('string',(3,), dtype=h5py.special_dtype(vlen=str))
In [6]: dset
Out[6]: <HDF5 dataset "string": shape (3,), type "|O">
In [7]: dset[0]='astring'
In [8]: dset[1]=b'astring'
In [9]: dset
Out[9]: <HDF5 dataset "string": shape (3,), type "|O">
In [10]: dset[:]
Out[10]: array(['astring', 'astring', ''], dtype=object)
So both Python string types got saved as unicode.
In [11]: dset.attrs['string']='unicode string'
In [12]: dset.attrs['bytes']=b'byte string'
In [13]: dset
Out[13]: <HDF5 dataset "string": shape (3,), type "|O">
In [14]: dset.attrs
Out[14]: <Attributes of HDF5 object at 2880654668>
In [15]: list(dset.attrs.items())
Out[15]: [('string', 'unicode string'), ('bytes', b'byte string')]
For attributes, string type is preserved.
In [16]: dset2=grp.create_dataset('bstring', (3,), dtype=h5py.special_dtype(vlen=bytes))
In [17]: dset2[0]='astring'
In [19]: dset2[1]=b'astring'
In [22]: dset2[:]
Out[22]: array([b'astring', b'astring', b''], dtype=object)
This wrote bytestrings both times.
In [25]: f.close()
And the dump
In [26]: !h5dump test.h5
HDF5 "test.h5" {
GROUP "/" {
GROUP "test" {
DATASET "bstring" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "astring", "astring", NULL
}
}
DATASET "string" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "astring", "astring", NULL
}
ATTRIBUTE "bytes" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "byte string"
}
}
ATTRIBUTE "string" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "unicode string"
}
}
}
}
}
}
So specifying bytes
instead of str
does the trick in Py3.