How to determine the position from decoding error message in Python?

Question

I keep receiving this 'Unicodedecodeerror'

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 16592600: character maps to

when trying to run the following for a unicode .txt file

f=open('FY16_Query_Analysis1.txt','rU')
raw=f.read()

I think that is the character at position 16592600 in the text file being read. Most text editors, code IDEs, etc should have a character or cursor position indicator to find that position within the text file, similar to this -> http://stackoverflow.com/questions/17153333/text-editor-which-tells-the-index-of-the-cursor-position — chickity china chinese chicken, Feb 16 '17 at 01:53
You need to try decoding with UTF-8 (see my answer). If it doesn't work, you need to provide more information in order for us to help you guess the right encoding. — lenz, Feb 16 '17 at 08:06
If you know what the problematic text is supposed to represent, but not which encoding it's in, you might be able to glean the correct encoding from a lookup table like https://cdn.rawgit.com/tripleee/8bit/master/encodings.html#8d — tripleee, Feb 16 '17 at 08:09
@lenz, thank you. I had to add: encoding='utf_8' for the file to run. Appreciate the help. — pdel5, Feb 16 '17 at 15:44

score 0 · Accepted Answer · answered Feb 16 '17 at 08:01

In short, you have to find out what encoding is used in the text file, and then specify it.

Step 2 is easy. For example, if the encoding is UTF-8:

f = open('FY16_Query_Analysis1.txt', 'r', encoding='utf8')

(As a side note: the use of the "U" mode character is deprecated, you should specify universal-newline mode with newline=None or simply omit it, since this is the default.)

If you don't specify encoding=, then your locale is used. To see what it is set to from within Python, try this (eg. in an interactive session):

import locale
locale.getpreferredencoding()

This tells you what is used now, which is apparently wrong.

Step 1, finding out what the correct encoding is, can be tricky. If the source of your file doesn't tell you, then you'll have to guess. A good guess to start with is always UTF-8, since (a) it is widespread and (b), more importantly, it is "picky": If UTF-8 is the wrong choice, then it is extremely likely that you will notice by receiving a UnicodeError.

You should try if this works. However, if it doesn't, then it gets tricky. Chances are that you are dealing with an 8-bit encoding, in which case you cannot rely on an exception-free pass – for example, decoding with Latin-1 will always work (you can even "decode" a JPEG image with Latin-1), but if it's the wrong choice the result is a string of gibberish. You'll have to do some trial and error with different 8-bit encodings, and look at the problematic position to see if the result is something reasonable.

score 0 · Answer 2 · answered Feb 16 '17 at 08:33

This error usually arises when you try to read a file using the wrong encoding. But given the very high offset in your case, it's also possible that you have the correct encoding but some sort of glitch in the file-- no point in speculating on the specifics.

Since the file seems to be mostly correct, I you can just ask for unhandled bytes to be replaced with the special character '�' (Unicode "\ufffd"). You can then find the context of the error by simply searching for this character.

f = open('FY16_Query_Analysis1.txt', errors="replace")
raw = f.read()
lines = raw.splitlines()
for line in lines:
    if '�' in line:
        print(line)

Depending on what you see, you can decide what to do next.

Alternately, you could read the file in binary mode and convert a string around the offset in question; for example:

f=open('FY16_Query_Analysis1.txt','rb')
raw = f.read()

errorpos = 16592600
fragment = raw[errorpos-20:errorpos+40]
print(repr(fragment))

How to determine the position from decoding error message in Python?

2 Answers2