1

I have a bunch of hdf files that I need to read in with pandas pd.read_hdf() but they have been saved in a python 2.7 environment. Nowadays, I'm on python 3.7, and when trying to read them with data = pd.read_hdf('data.h5', 'data'), I'm getting

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 6: invalid start byte

Now I know, those files can contain various weird things like Ä or ö, and 0xf6 probably is ö.

So how do I read this hdf file?

The documentation for read_hdf only specifies mode as a parameter, but this doesn't do anything. Apparently, this is an old bug in pandas, or rather with underlying pytables that can't be fixed. However, that is from 2017, so I wonder if that's fixed, or rather if there's a workaround that I just can't find. According to the bug report, you can also pass enconding='' to the reader, but that doesn't do anything when I specify encoding='UTF8' as suggested in the bug, or encoding='cp1250' which I would assume could be the culprit.

It's quite annoying to have a file format that is meant to archive data, which apparently can't be read anymore by the program that produced it after just one version step. I would be perfectly fine with just having the ös garbled to ␣ý⌧ or similar fun things as usual with encoding errors, but simply not being able to read it is an issue.

JC_CL
  • 2,346
  • 6
  • 23
  • 36
  • Was the file generated on a Windows computer? Than you might try a cp1252 or latin1 encoding. Of course, you can also open the file with in binary mode 'rb', then you will get the raw bytes instead of (unicode) strings. – EvertW May 19 '20 at 09:10
  • It was generated on linux, but the workflow that produced it was using a lot of csv files originating from some old windows box. `cp1252` and `latin1` also produce the same error. `mode=rb` was also my first idea, but `mode` only supports b, b+ and a. – JC_CL May 19 '20 at 09:20
  • 2
    My comment assumed that you were reading a file directly. As HDF5 is a binary format, your file is always opened as such. The conversion to unicode is done by the HDF reader. I think you will need to modify the low-level reader to solve this one, and curse the old HDF5 writer for having a cavalier attitude towards Unicode. Your file probably reports the UTF8 encoding, but uses latin1 internally--an error. – EvertW May 19 '20 at 09:36
  • 1
    Your best bet is to open the dataset in another tool (e.g. python2 with hdf5 reader), and re-write the data ensuring that the encoding is correct. – EvertW May 19 '20 at 09:55

0 Answers0