1

I have a script that logs data on a Windows machine (Win 7) using Python 2.7. I want to read these files on my RHEL machine using Python 3.5. I keep getting the following error (or similar):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 825929: ordinal not in range(128)

To make matters worse, the data are being passed into the computer in hex/ascii format (why the manufacturer did this I do not know) so the integer 27035 shows up in the text file as 0x699b. So the data will look something like this:

0001100011000190001600011000110001300013000120001200013000140001a0002

Two write the data in Python 2.7 I simply do:

with open('dst.txt', 'w') as fid:
    fid.write(data_stream)

I had no problem reading these files when using 2.7 on my office computer, but after switching to 3.5 I do.

This used to work under 2.7:

with open('src.txt', 'r') as tmp:
    data = tmp.read().split('\n')

Using the same script under 3.5 caused errors (as above), so I defined the encoding:

with open('src.txt', 'r', encoding='latin-1') as tmp:
    data = tmp.read().split('\n')

This works most of the time (strange, because "open" under Python 2.7 should default to encoding='ascii'...NOTE: defining encoding as "ascii" still results in errors), I can at least read the file this way. The problem now is that not all of the lines contain the same number of characters (they should!). Infrequently I will have a line missing one or two characters. I find the shorter lines via:

for r in data:
    if len(r) < 7721:
        print(r)

Within these lines I find strange characters like:

Ö\221Á
Ö\231\231Ù

where \221 and \231 show up as single characters (i.e. not four as you would expect).

I guess my question is: what is going on here? I could throw away rows that do not have enough characters (this would be less than 1% of the data), but it just irks me that this does not work.

Is this caused by the data being converted to hex first, then written into ascii encoding, followed by decoding via latin-1 (that is a lot going on). If that is the case, then why can I not decode the data by specifying ascii encoding?

EDIT I loaded the data different ways:

open('src.txt', 'rU', encoding='latin-1')
open('src.txt', 'rb')
open('src.txt', 'rU', encoding='Windows-1252')

The data remain the same, but the mis-translated portions changed:

fÖ\211Áffe7700
f\xd6\x89\xc1ffe7700
fÖ‰Áffe7700

Whatever is between the "f" and "ffe7700" is what is not working.

tnknepp
  • 5,888
  • 6
  • 43
  • 57
  • do you mean the file is a binary file? In which case you'd want to read it as bytes by doing `"rb"` instead of `"r"` when opening the file. In python2 I don't think it changes anything but in python 3 it will read the raw data as `bytes` objects instead of `str` ones, this is because in python 3 all strings are unicode which is a real pain when dealing with raw byte data. – Tadhg McDonald-Jensen May 19 '17 at 19:07
  • No, it is a text file that has the data written in hex (does that make it binary?). Setting the read to 'rb' did not help, I still had shorter rows – tnknepp May 19 '17 at 19:35
  • when you say _"the integer 27035 shows up in the text file as 0x699b"_ do you mean the hexidecimal representation of the file shows `0x699b` or do you mean the string `"0x699b"` is text in the file? And then the data following _"So the data will look something like this:"_ is that what would show up in a text editor or a hex reader? How does that data relate to `27035`? – Tadhg McDonald-Jensen May 19 '17 at 19:52
  • The string is text in a file. The data I show is what shows up in a text file and there is no relation to 27035. I chose 27035 as an arbitrary example, while the posted text is from the actual text file. In the data files each hex string is 5 characters long, so you convert to int by breaking up the big-long string into five character segments and using int (e.g. int(line[:5], 16) – tnknepp May 19 '17 at 20:05
  • ok so the integer `27035` would actually show up as `0699b` as part of a long string of hexadecimal digits, so the file should only contain digits, `abcdef` and the newline character? So you are telling me that it contains odd non ascii characters that work fine in 2.7 but not in 3.5? I find that just a bit hard to believe, are you sure the problem only happens when you use python 3? – Tadhg McDonald-Jensen May 19 '17 at 20:18
  • I share your disbelief. I have never had a problem when using 2.7. Since the issue is so rare I suppose it could be just a bad bit getting transferred, but this would be unexpected as well. We have a couple other instruments like this, I will have to look at their data next week to see if this is happening across the board. – tnknepp May 19 '17 at 20:55

1 Answers1

0

Perhaps the file is not latin-1.

I would use chardet to detect the file encoding.

$ chardetect src.txt
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • I doubted it was latin-1 (that encoding just happened to work somewhat). I ran chardetect and it returned a 72% chance of encoding being "Windows-1252". After specifying Windows-1252 I still have the same problem, but different! The Ö\221Á has now changed to Ö‰Á. – tnknepp May 19 '17 at 19:38
  • @tnknepp That's a low confidence! ... windows-1252 generally means chardet hasn't worked out an encoding (see http://chardet.readthedocs.io/en/latest/how-it-works.html), of course it could be that the file is has mixed encodings or is corrupt :s – Andy Hayden May 19 '17 at 19:52
  • `chardet` is unreliable. Your html document may declare one encoding (e.g. gb2312) and then use another one (e.g. gb18030). `chardet` will naively trust the declaration, but when you try to open with `encoding="chardet told me this"`, you'll run into a `UnicodeDecodeError`. – imrek Aug 08 '20 at 07:38