How can I open UTF-16 files on Python 2.x?

Question

I'm working on a Python tool that must be able to open files of UTF-8 and UTF-16 encoding. In Python 3.2, I use the following code to try opening the file using UTF-8, then try it with UTF-16 if there's a unicode error:

def readGridFromPath(self, filepath):
    try:
        self.readGridFromFile(open(filepath,'r',encoding='utf-8'))
    except UnicodeDecodeError:
            self.readGridFromFile(open(filepath,'r',encoding='utf-16'))

(readGridFromFile will either run through to completion, or raise a UnicodeDecodeError. )

However, when I run this code in Python 2.x, I get:

TypeError: 'encoding' is an invalid keyword argument for this function

I see in the docs that Python 2.x's open() doesn't have an encoding keyword. Is there any way around this that will allow me to make my code Python 2.x compatible?

toriningen · Accepted Answer · 2016-01-21T07:35:45.740

21

io.open is drop-in replacement for your needs, so code sample you've provided will look as follows in Python 2.x:

import io

def readGridFromPath(self, filepath):
    try:
        self.readGridFromFile(io.open(filepath, 'r', encoding='utf-8'))
    except UnicodeDecodeError:
        self.readGridFromFile(io.open(filepath, 'r', encoding='utf-16'))

io.open is described here in detail. Its prototype is:

io.open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)

io module itself was designed as compatibility layer between Python 2.x and Python 3.x, to ease transition to Py3k and simplify back-porting and maintenance of existing Python 2.x code.

Also, please note that there can be a caveat using codecs.open, as it works in binary mode only:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n'` is done on reading and writing.

Also you may run into issues of manually detecting and stripping out UTF8 BOM — codecs.open leaves UTF8 BOM inline as u'\ufeff' character.

edited Jan 21 '16 at 07:35

answered Apr 07 '12 at 21:57

toriningen

7,196
3
46
68

Good call, `io.open` is the better option. However, the disadvantages of `codecs.open` aren't really significant enough to call it "unsuitable", IMHO. – Niklas B. Apr 07 '12 at 22:16
By the way, the claim about `codecs.open` not handling BOM correctly is just wrong (I tried it). The thing about it not automatically converting newlines is true, though (but this seems to be the only difference). – Niklas B. Apr 07 '12 at 22:24
I've tried that now again — for UTF-16 BE/LE it works pretty fine, but for UTF8 its BOM (EB BB BF) is left in decoded text as u'\ubeff'. I clearly remember I had decoding problems with BOM using `.decode()` on Windows, but I cannot test it now. I've fixed that claim for fairness. – toriningen Apr 07 '12 at 23:08
4

\ufeff is the BOM (EF BB BF in UTF-8), and the 'utf-8-sig' codec will detect and strip it if present, as will 'utf16' for UTF-16 encoded text. 'utf-16le' and 'utf-16be' assume no BOM and won't remove it. – Mark Tolonen Apr 07 '12 at 23:39
@MarkTolonen, hmmm, seems that [russian wikipedia](http://ru.wikipedia.org/wiki/BOM#.D0.9F.D1.80.D0.B5.D0.B4.D1.81.D1.82.D0.B0.D0.B2.D0.BB.D0.B5.D0.BD.D0.B8.D0.B5_.D0.BA.D0.BE.D0.B4.D0.B8.D1.80.D0.BE.D0.B2.D0.BA.D0.B8_byte_order_marks) tricked me over, as it states **EB** BB BF (**235** 187 191) is UTF8 BOM. – toriningen Apr 07 '12 at 23:45
@modchan: I see the correct `BOM_UTF8` value on [the corresponding page](https://ru.wikipedia.org/wiki/%D0%9C%D0%B0%D1%80%D0%BA%D0%B5%D1%80_%D0%BF%D0%BE%D1%81%D0%BB%D0%B5%D0%B4%D0%BE%D0%B2%D0%B0%D1%82%D0%B5%D0%BB%D1%8C%D0%BD%D0%BE%D1%81%D1%82%D0%B8_%D0%B1%D0%B0%D0%B9%D1%82%D0%BE%D0%B2) – jfs Jan 21 '16 at 16:15
@J.F.Sebastian [here](https://ru.wikipedia.org/w/?title=Маркер_последовательности_байтов&diff=next&oldid=42802415) you can see the vandalizing change from March 21st, 2012, that I [had then fixed](https://ru.wikipedia.org/w/?title=Маркер_последовательности_байтов&diff=next&oldid=42805698) in next edit on April 8th, 2012. – toriningen Jan 23 '16 at 13:55

How can I open UTF-16 files on Python 2.x?

1 Answers1