In short, you have to find out what encoding is used in the text file, and then specify it.
Step 2 is easy. For example, if the encoding is UTF-8:
f = open('FY16_Query_Analysis1.txt', 'r', encoding='utf8')
(As a side note: the use of the "U"
mode character is deprecated, you should specify universal-newline mode with newline=None
or simply omit it, since this is the default.)
If you don't specify encoding=
, then your locale is used.
To see what it is set to from within Python, try this (eg. in an interactive session):
import locale
locale.getpreferredencoding()
This tells you what is used now, which is apparently wrong.
Step 1, finding out what the correct encoding is, can be tricky.
If the source of your file doesn't tell you, then you'll have to guess.
A good guess to start with is always UTF-8, since (a) it is widespread and (b), more importantly, it is "picky": If UTF-8 is the wrong choice, then it is extremely likely that you will notice by receiving a UnicodeError.
You should try if this works.
However, if it doesn't, then it gets tricky.
Chances are that you are dealing with an 8-bit encoding, in which case you cannot rely on an exception-free pass – for example, decoding with Latin-1 will always work (you can even "decode" a JPEG image with Latin-1), but if it's the wrong choice the result is a string of gibberish.
You'll have to do some trial and error with different 8-bit encodings, and look at the problematic position to see if the result is something reasonable.