2

I have to write a script that support reading of a file which can be saved as either Unicode or Ansi (using MS's notepad).

I don't have any indication of the encoding format in the file, how can I support both encoding formats? (kind of a generic way of reading files with out knowing the format in advanced).

YSY
  • 1,226
  • 3
  • 13
  • 19
  • Which Python version are you using? 2.x and 3.x handle Unicode differently. – Brigand Dec 11 '11 at 18:46
  • For unicode, you can use a ByteOrderMark (BOM) on UTF-16 files to show that it is in fact unicode, and which order the bytes are in. regular "ansi" (ascii, I presume?) are highly unlikely to start with such a marker. – Marc B Dec 11 '11 at 18:49
  • Beware that "ANSI" is not a character encoding: John Machin's answer is the only one so far with an accurate definition of what "ANSI" is. – Thanatos Dec 11 '11 at 23:20

2 Answers2

16

MS Notepad gives the user a choice of 4 encodings, expressed in clumsy confusing terminology:

"Unicode" is UTF-16, written little-endian. "Unicode big endian" is UTF-16, written big-endian. In both UTF-16 cases, this means that the appropriate BOM will be written. Use utf-16 to decode such a file.

"UTF-8" is UTF-8; Notepad explicitly writes a "UTF-8 BOM". Use utf-8-sig to decode such a file.

"ANSI" is a shocker. This is MS terminology for "whatever the default legacy encoding is on this computer".

Here is a list of Windows encodings that I know of and the languages/scripts that they are used for:

cp874  Thai
cp932  Japanese 
cp936  Unified Chinese (P.R. China, Singapore)
cp949  Korean 
cp950  Traditional Chinese (Taiwan, Hong Kong, Macao(?))
cp1250 Central and Eastern Europe 
cp1251 Cyrillic ( Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian)
cp1252 Western European languages
cp1253 Greek 
cp1254 Turkish 
cp1255 Hebrew 
cp1256 Arabic script
cp1257 Baltic languages 
cp1258 Vietnamese
cp???? languages/scripts of India  

If the file has been created on the computer where it is being read, then you can obtain the "ANSI" encoding by locale.getpreferredencoding(). Otherwise if you know where it came from, you can specify what encoding to use if it's not UTF-16. Failing that, guess.

Be careful using codecs.open() to read files on Windows. The docs say: """Note Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.""" This means that your lines will end in \r\n and you will need/want to strip those off.

Putting it all together:

Sample text file, saved with all 4 encoding choices, looks like this in Notepad:

The quick brown fox jumped over the lazy dogs.
àáâãäå

Here is some demo code:

import locale

def guess_notepad_encoding(filepath, default_ansi_encoding=None):
    with open(filepath, 'rb') as f:
        data = f.read(3)
    if data[:2] in ('\xff\xfe', '\xfe\xff'):
        return 'utf-16'
    if data == u''.encode('utf-8-sig'):
        return 'utf-8-sig'
    # presumably "ANSI"
    return default_ansi_encoding or locale.getpreferredencoding()

if __name__ == "__main__":
    import sys, glob, codecs
    defenc = sys.argv[1]
    for fpath in glob.glob(sys.argv[2]):
        print
        print (fpath, defenc)
        with open(fpath, 'rb') as f:
            print "raw:", repr(f.read())
        enc = guess_notepad_encoding(fpath, defenc)
        print "guessed encoding:", enc
        with codecs.open(fpath, 'r', enc) as f:
            for lino, line in enumerate(f, 1):
                print lino, repr(line)
                print lino, repr(line.rstrip('\r\n'))

and here is the output when run in a Windows "Command Prompt" window using the command \python27\python read_notepad.py "" t1-*.txt

('t1-ansi.txt', '')
raw: 'The quick brown fox jumped over the lazy dogs.\r\n\xe0\xe1\xe2\xe3\xe4\xe5
\r\n'
guessed encoding: cp1252
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-u8.txt', '')
raw: '\xef\xbb\xbfThe quick brown fox jumped over the lazy dogs.\r\n\xc3\xa0\xc3
\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\r\n'
guessed encoding: utf-8-sig
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-uc.txt', '')
raw: '\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w
\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\x00e
\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\x00.
\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n\x00'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-ucb.txt', '')
raw: '\xfe\xff\x00T\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\
x00w\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\
x00e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\
x00.\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

Things to be aware of:

(1) "mbcs" is a file-system pseudo-encoding which has no relevance at all to decoding the contents of files. On a system where the default encoding is cp1252, it makes like latin1 (aarrgghh!!); see below

>>> all_bytes = "".join(map(chr, range(256)))
>>> u1 = all_bytes.decode('cp1252', 'replace')
>>> u2 = all_bytes.decode('mbcs', 'replace')
>>> u1 == u2
False
>>> [(i, u1[i], u2[i]) for i in xrange(256) if u1[i] != u2[i]]
[(129, u'\ufffd', u'\x81'), (141, u'\ufffd', u'\x8d'), (143, u'\ufffd', u'\x8f')
, (144, u'\ufffd', u'\x90'), (157, u'\ufffd', u'\x9d')]
>>>

(2) chardet is very good at detecting encodings based on non-Latin scripts (Chinese/Japanese/Korean, Cyrillic, Hebrew, Greek) but not much good at Latin-based encodings (Western/Central/Eastern Europe, Turkish, Vietnamese) and doesn't grok Arabic at all.

John Machin
  • 81,303
  • 11
  • 141
  • 189
3

Notepad saves Unicode files with a byte order mark. This means that the first bytes of the file will be:

  • EF BB BF -- UTF-8
  • FF FE -- "Unicode" (actually UTF-16 little-endian, looks like)
  • FE FF -- "Unicode big-endian" (looks like UTF-16 big-endian)

Other text editors may or may not have the same behavior, but if you know for sure Notepad is being used, this will give you a decent heuristic for auto-selecting the encoding. All these sequences are valid in the ANSI encoding as well, however, so it is possible for this heuristic to make mistakes. It is not possible to guarantee that the correct encoding is used.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • +1 for pointing out that trying to detect the encoding by looking for a BOM is only a heuristic. Note that guessing at encodings this way is not recommended by by the Unicode committee. BOMs are intended only to indicate byte ordering within a known encoding. They are not recommended even just to distinguish between UTF-32, UTF-16 and UTF-8. – bames53 Dec 11 '11 at 23:52
  • 1
    @bames53. Your comments seem contrary to the advice given in the Unicode [Byte Order Mark FAQ](http://www.unicode.org/faq/utf_bom.html#BOM) - particularly the section "Q: Where is a BOM useful?". Of course _any_ encoding signature can always lie about the true encoding of a file. But for most practical purposes, the BOM is a reasonably reliable indicator. – ekhumoro Dec 12 '11 at 00:45
  • @bames53: What BOMs were intended for is irrelevant. The question is, given the known behaviour of Notepad, and the knowledge that a file was in fact created by Notepad, what is the best strategy for reading such a file. The fact that such a strategy involves heuristics should go without saying. – John Machin Dec 12 '11 at 01:09
  • @ekhumoro the recommended method for determining encoding is to have it explicitly stated external to the data stream itself. However because there are protocols that don't follow best practices, from Microsoft in particular, the FAQ lists ways of guessing that are generally reliable. The use of a BOM as a signature was developed by people that wouldn't or couldn't use best practices, and sometimes we just have to deal with it. Of course, IMO a better solution is to follow Michael Kaplan's advice and stop using Windows notepad: http://blogs.msdn.com/b/michkap/archive/2010/02/23/9967789.aspx – bames53 Dec 12 '11 at 01:19
  • @JohnMachin No, I think that the fact that the method is only a heuristic should be explicitly stated, because it's important to know that it's not 100% accurate and that there are ways to fool it. As I said above, yes, sometimes we just have to deal with it, but that's not a reason to fail to fully explain or understand what we're doing. – bames53 Dec 12 '11 at 01:23
  • @bames53. I don't think we're really in disagreement. I just thought your initial comment put things a little too strongly. This is Python, after all, so: Practicality Beats Purity ;-) – ekhumoro Dec 12 '11 at 01:44