I have a bunch of hdf files that I need to read in with pandas pd.read_hdf()
but they have been saved in a python 2.7 environment. Nowadays, I'm on python 3.7, and when trying to read them with data = pd.read_hdf('data.h5', 'data')
, I'm getting
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 6: invalid start byte
Now I know, those files can contain various weird things like Ä or ö, and 0xf6
probably is ö.
So how do I read this hdf file?
The documentation for read_hdf
only specifies mode
as a parameter, but this doesn't do anything. Apparently, this is an old bug in pandas, or rather with underlying pytables that can't be fixed. However, that is from 2017, so I wonder if that's fixed, or rather if there's a workaround that I just can't find. According to the bug report, you can also pass enconding=''
to the reader, but that doesn't do anything when I specify encoding='UTF8'
as suggested in the bug, or encoding='cp1250'
which I would assume could be the culprit.
It's quite annoying to have a file format that is meant to archive data, which apparently can't be read anymore by the program that produced it after just one version step. I would be perfectly fine with just having the ös garbled to ␣ý⌧
or similar fun things as usual with encoding errors, but simply not being able to read it is an issue.