Python 2.7 reads encoded text file as code rather than text. (Fixed with io module)

Question

I have a text file (*.txt) which displays as plain text when opened in notepad. When i attempt to read the file into python:

with open(Working_File,'r') as WorkTXT:
    WorkTXT_Lines = WorkTXT.readlines()
    WorkTXT.close()

My script then fails because the text is being converted into something else. I can manually test what's in the list using the console:

In[51]: WorkTXT_Lines[4]
Out[51]: "\x00T\x00h\x00e\x00 \x00A\x00c\x00q\x00.\x00 \x00M\x00e\x00t\x00h\x00o\x00d\x00'\x00s\x00 \x00I\x00n\x00s\x00t\x00r\x00u\x00m\x00e\x00n\x00t\x00 \x00P\x00a\x00r\x00a\x00m\x00e\x00t\x00e\x00r\x00s\x00 \x00f\x00o\x00r\x00 \x00t\x00h\x00e\x00 \x00R\x00u\x00n\x00 \x00w\x00e\x00r\x00e\x00 \x00:\x00 \x00\r\x00\n"

If i open the original text file and copy-paste the text into a new text file then run it seems to pick up actual text and the script works correctly. That does not help though as i am parsing through hundreds of text files generated from a lab instrument.

Any help is appreciated, even something like an OS command to alter the text file.

Edit - was able to solve the issue after being led in the correct direction. The io module is able to decode the text file and "read as text (rt)"

import io
with io.open(Working_File,'rt') as WorkTXT:
    WorkTXT_Lines = WorkTXT.readlines()
    WorkTXT.close()

Are you using Python 2, or Python 3? Either way, the root problem is that you're trying to read something that's almost certainly UTF-16-BE as if it were ASCII or Windows-1252 or Latin-1 or similar, but the right way to fix it will be different. — abarnert, Jun 27 '18 at 22:07
@MoxieBall I'm not sure it's a dup, because that question is 2.x-specific. (It's also about UTF-16-LE with a BOM, rather than UTF-16-BE without, but that's not a big difference.) — abarnert, Jun 27 '18 at 22:09
Actually, I just noticed that you only printed line 4, not line 0. So this actually _might_ be UTF-16-LE with a BOM. If you call `readlines()` on that, it'll split on the half-character that ends in `\n`, and start the next line with an extra `\0`. So (assuming the text is mostly ASCII) all the lines after the first can end up looking like UTF-16-BE even though they're -LE. — abarnert, Jun 27 '18 at 22:11
.decode("utf-16") you must decode the whole file or the string — ThunderHorn, Jun 27 '18 at 22:15

ThunderHorn · Answer 1 · 2018-06-28T16:14:17.443

1

The page content is encoded i googled your output and it said it was utf-16 if you decode the file after reading it everything becomes in plain text

import io

with io.open(Working_File,'r', encoding='utf-16-le' ) as WorkTXT:
    #here you read the whole file -> decode it -> and split it to lines 
    #now you are working with a plain text :) 
    WorkTXT_Lines = WorkTXT.readlines() 
    for line in WorkTXT_Lines:
        print(line)

edited Jun 28 '18 at 16:14

answered Jun 27 '18 at 22:17

ThunderHorn

1,975
1
20
42

1

please add some explanation – Harsha Biyani Jun 28 '18 at 07:11
Thanks for this suggestion. Currently using this method i get a list of length 1. [u'\u6144\u6174\u4620\u6c69\u2065\u3a20\u5920\u5c3a\u5550\u4952\u5946\u545c\u5341\u534b\u305c\u3236\u3130\u5f38\u6956\u5669\u505f\u4552\u5f50\u4146\u505c\u4552\u5f50\u4553\u2051\u3032\u3831\u302d\u2d36\u3032\u3020\u2d37\u3335\u342d\u5c39\u364e\u3338\u3836\ (etc...) So it looks like it gets decoded but not into anything particularly useful. I also tried switching to "utf-8" but that fails to run. – user1933192 Jun 28 '18 at 15:17
I get the error "TypeError: 'encoding' is an invalid keyword argument for this function" -but it might be related to me using Python 2.7 as part of Python(x,y). Now that you've pointed me in this direction i'm reading some documentation on the io module. – user1933192 Jun 28 '18 at 15:52
yeah sorry i was in the train and was typing from my phone couldn't compile sorry – ThunderHorn Jun 28 '18 at 16:15
I think your newest edit works. tried this and it seems to work correctly. ---------- ---------- with io.open(Working_File,'rt') as WorkTXT: ---------- ---------- 'rt' seems to indicate to python that the data should spit out as text and it figures out the encoding type by itself. You were very helpful, thanks. – user1933192 Jun 28 '18 at 16:21

Python 2.7 reads encoded text file as code rather than text. (Fixed with io module)

1 Answers1