Python Read String from File with Strange Encoding

Question

I made a pig latin translator that takes input from the user, translates it, and returns it. I want to add the ability to input a text file to take text from but I'm running into an issue that the file isn't being opened as I expect. Here is my code:

from sys import argv
script, filename = argv

file = open(filename, "r")

sentence = file.read()

print sentence

file.close()

The problem is that when I print out the information inside the file it looks like this:

■T h i s   i s   s o m e   t e x t   i n   a   f i l e

Instead of this:

This is some text in a file

I know I could do a workaround the spaces and the odd square character with slicing, but I feel like that is treating a symptom and I want to understand why the text is formatted weird so maybe I can fix the cause.

Hey, so I can be a bit more accurate in my answer, could you edit your post and put the result of `hexdump ` and `file ` from the command line? Assuming you aren't on Windows. — Will, Jan 04 '16 at 03:17
Or at least tell us the program you used to make that text file. — PM 2Ring, Jan 04 '16 at 03:58
I used notepad++. As to doing the hexdump I will do that when I get home. — Supetorus, Jan 04 '16 at 15:53
I tried doing the hexdump and it said the command hexdump is not recognized. Same happened with file. — Supetorus, Jan 05 '16 at 16:18
@Supetorus: `hexdump` and `file` are standard commands on Unix-like systems, which is why Will said "Assuming you **aren't** on Windows". — PM 2Ring, Jan 06 '16 at 06:09

Will · Accepted Answer · 2016-01-04T03:58:22.390

4

I believe this is a Unicode UTF-16 encoded file, and this is the "Unicode Byte Order Mark" (BOM). It could also be another encoding with a byte-order mark, but it definitely appears to be a multi-byte encoding.

This is also why you're seeing the whitespace between characters. UTF-16 effectively represents each character as two bytes, but for standard ASCII characters like you're using, the other half of the character is empty (second byte is 0).

Try this instead:

from sys import argv
import codecs
script, filename = argv

file = codecs.open(filename, encoding='utf-16')
sentence = file.read()
print sentence
file.close()

Replace encoding='utf-16' with whatever encoding this actually is. You might just need to try a few and experiment.

edited Jan 04 '16 at 03:58

answered Jan 04 '16 at 03:10

Will

24,082
14
97
108

Typo: the other half of the **character** is zero. Your answer looks good, pity the OP is not responding. – PM 2Ring Jan 04 '16 at 03:57
Thanks! Corrected :) Yeah, I'm hoping OP can show the output of `hexdump ` and `file ` so we can figure out more clearly what the encoding is. I think SO converted it to UTF-8, and hexdumping it on my end doesn't show any known byte-order mark. – Will Jan 04 '16 at 03:59
Aha. I checked my notepad++ settings and it creates files in UTF-8. I used the codecs stuff to fix it and it worked. Also I tried using hexdump and file as I told another guy in a comment above and both said the command is not recognized. Perhaps I didn't use it as you meant. I typed it directly into powershell as `hexdump text.txt` text.txt being the name of my file. Same with file. – Supetorus Jan 05 '16 at 16:24

score 2 · Answer 2 · answered Jan 04 '16 at 03:51

2

The original file is UTF-16. Here's an example that writes a UTF-16 file and reads it with open vs. io.open, which takes an encoding parameter:

#!python2
import io

sentence = u'This is some text in a file'

with io.open('file.txt','w',encoding='utf16') as f:
    f.write(sentence)

with open('file.txt') as f:
    print f.read()

with io.open('file.txt','r',encoding='utf16') as f:
    print f.read()

Output on US Windows 7 console:

 ■T h i s   i s   s o m e   t e x t   i n   a   f i l e
This is some text in a file

As a guess, I'd say the OP created the text file in Windows Notepad and saved it as "Unicode", which is Microsoft's misnomer for UTF-16 encoding.

answered Jan 04 '16 at 03:51

Mark Tolonen

166,664
26
169
251

Great answer! What confused me is that when I tried to `hexdump` the BOM from the post text, it didn't seem to be a UTF-16 BOM. But I'm guessing that's just because SO uses UTF-8 :) – Will Jan 05 '16 at 21:41
@Will, the post text was likely decoded in `cp437`, since that's what my terminal is, and the ■ character is FEh in that encoding. That's part of the UTF-16 BOM :) – Mark Tolonen Jan 06 '16 at 03:15

score 1 · Answer 3 · answered Jan 05 '16 at 16:36

At first when I saw everyone responding with stuff about unicode and utf I shied away from reading and trying to fix it, but I'm persistent about learning to program in python so I did some research, primarily this website. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

That was really helpful. So what I can gather is that notepad++ which I used to write the text file, wrote it in UTF-8, and python read it in UTF-16. The solution was to import codecs, and use the codecs function like this (as Will said above): from sys import argv import codecs

script, filename = argv

file = codecs.open(filename, encoding = "utf-8")

sentence = file.read()

print sentence

file.close()

Weird. UTF-8 shouldn't produce those spaces for chars in the ASCII range. But anyway... As well as that article by SO co-founder Joel, you may like to take a look at [unipain](http://nedbatchelder.com/text/unipain.html) by SO veteran Ned Batchelder, which is more Python-specific. — PM 2Ring, Jan 06 '16 at 06:17

score 0 · Answer 4 · answered Jan 04 '16 at 03:10

Well - the most striking explanation is that your file is reading the data correctly.

As to why there is weird output - could be due to some many reasons

However it looks like you are using Python 2 (print statement) - And as the text is appearing as

CHARCHAR

I would assume that the file you are reading is UNICODE encoded text - so that ABC is witten \u0065\u0066\u0067

Either decode the byte string - until a Unicode string - or use Python 3 and look the Unicode issue.

Python Read String from File with Strange Encoding

4 Answers4