Python UTF-16 CSV reader

Question

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.

I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.

Answers for John Machin questions below

print repr(open('test.csv', 'rb').read(100))

Output with test.csv having just abc as content

'\xff\xfea\x00b\x00c\x00'

I think csv file got created on windows machine in USA. I am using Mac OSX Lion.

If I use code provided by phihag and test.csv containing one record.

sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output

'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'

Code by phihag

import codecs
import csv
with open('test.csv','rb') as f:
      sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))      
      for row in csv.reader(sr):
         print row

Output of the above code

['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']

expected output is

['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']

phihag · Accepted Answer · 2012-02-07T23:29:05.317

At the moment, the csv module does not support UTF-16.

In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:

# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
    for line in csv.reader(csvf):
        print(line) # do something with the line

In Python 2.x, you can recode the input:

# Python 2.x only
import codecs
import csv

class Recoder(object):
    def __init__(self, stream, decoder, encoder, eol='\r\n'):
        self._stream = stream
        self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
        self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
        self._buf = ''
        self._eol = eol
        self._reachedEof = False

    def read(self, size=None):
        r = self._stream.read(size)
        raw = self._decoder.decode(r, size is None)
        return self._encoder.encode(raw)

    def __iter__(self):
        return self

    def __next__(self):
        if self._reachedEof:
            raise StopIteration()
        while True:
            line,eol,rest = self._buf.partition(self._eol)
            if eol == self._eol:
                self._buf = rest
                return self._encoder.encode(line + eol)
            raw = self._stream.read(1024)
            if raw == '':
                self._decoder.decode(b'', True)
                self._reachedEof = True
                return self._encoder.encode(self._buf)
            self._buf += self._decoder.decode(raw)
    next = __next__

    def close(self):
        return self._stream.close()

with open('test.csv','rb') as f:
    sr = Recoder(f, 'utf-16', 'utf-8')

    for row in csv.reader(sr):
        print (row)

open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:

try:
    from io import BytesIO
except ImportError: # Python < 2.6
    from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
    c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
    print(line) # do something with the line

Thanks @phihag for your response. Is there a way to do this without loading file into memory? My csv file is huge. — venky, Feb 07 '12 at 14:53
Try the first method; it will fail with a `UnicodeError` if the stream doesn't. You can also examine the first two bytes of the file; if they are `FE FF` or `FF FE`, that's the BOM. — phihag, Feb 07 '12 at 15:21
While trying StreamReader option by @phihag csv reader seems to read record partially sometimes. When I open the file in vi. I see <85> in the line where it seems to think it is end of record, but there are two other fields after <85> character. Looks like leftover fields are treated as next record — venky, Feb 07 '12 at 15:33
Can you upload a demonstration file somewhere? Without it, I can't reproduce the problem. Also, does the second method fail as well when used on the demonstration file? — phihag, Feb 07 '12 at 15:36
Second method i.e. using BytesIO worked for me with sample file. It is loading the file into memory which I can not do in my case (huge file). @phihag — venky, Feb 07 '12 at 16:19

score 4 · Answer 2 · edited Oct 22 '12 at 15:09

4

The Python 2.x csv module documentation example shows how to handle other encodings.

edited Oct 22 '12 at 15:09

Antony Hatchkins

31,947
10
111
111

answered Feb 08 '12 at 01:48

Mark Tolonen

166,664
26
169
251

1

What the documentation actually says is: "So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs." – Antony Hatchkins Oct 22 '12 at 13:24
@Antony did you read the final example? It recodes to ANY encoding as UTF-8 before passing it to the csv module. – Mark Tolonen Oct 22 '12 at 14:25
Yep, the issue is addressed in just a few lines which do pretty much the same as the code from @phihag's answer. I would quote the example explicitly though - to make reader's life easier :) Downvote removed. – Antony Hatchkins Oct 22 '12 at 15:08
This was in addition to phihag's answer and a gentle RTFM :) – Mark Tolonen Oct 22 '12 at 15:39
Good addition :) Poorly written `csv` module code (utf16 is not THAT horrible and it is one of the defaults for Excel output) and documentation (it is not obvious that the final example deals with both NULs and utf16 as well) is due to Guido wanting everybody to move to python 3.x, I guess. – Antony Hatchkins Oct 22 '12 at 16:39

John Machin · Answer 3 · 2012-02-08T21:40:05.353

I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.

Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:

print repr(open('thefile.csv', 'rb').read(100))

Four possible ways of encoding u'abc'

\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le

If you have any trouble with this step, edit your question to include the results of the above print repr()

Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:

import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
    buf = fi.read(BUFSIZ)
    if not buf: break
    if first and enc == 'utf_16':
        bom = buf[:2]
        buf = buf[2:]
        enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
        # KeyError means file doesn't start with a valid BOM
    first = False
    fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()

Other matters:

You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.

The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?

Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net

Update based on 1-line sample data provided.

This confirms my suspicions. Read this. Here's a quote from it:

... the C1 control characters ... are rarely used directly, except on specific platforms such as OpenVMS. When they turn up in documents, Web pages, e-mail messages, etc., which are ostensibly in an ISO-8859-n encoding, their code positions generally refer instead to the characters at that position in a proprietary, system-specific encoding such as Windows-1252 or the Apple Macintosh ("MacRoman") character set that use the codes provided for representation of the C1 set with a single 8-bit byte to instead provide additional graphic characters

This code:

s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
    print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
    c2 = c.encode('latin1').decode('cp1252')
    print "to:   U+%04X %s" % (ord(c2), name(c2, "<no name>"))

s3 = u''.join(
    c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
    for c in s2
    )
print 's3 repr:', repr(s3)
print 's3:', s3

produces the following (Python 2.7.2 IDLE, Windows 7):

s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to:   U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to:   U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to:   U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I

Which do you think is a more reasonable interpretation of \x96:

SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?

Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.

score -1 · Answer 4 · answered Feb 07 '12 at 15:04

-1

Just open your file with codecs.open like in

import codecs, csv

stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)

And work through your program with unicode strings, as you should do anyway if you are processing text

answered Feb 07 '12 at 15:04

jsbueno

99,910
10
151
209

1

for record in csv.reader(stream): line throws exception UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 77: ordinal not in range(128) – venky Feb 07 '12 at 15:09
This works fine in Python 3.x (although one could just write `open` instead of `codecs.open`), but fails in 2.x because `csv` tries to re-encode the unicode characters it reads from the stream. – phihag Feb 07 '12 at 15:09

Python UTF-16 CSV reader

4 Answers4

Linked

Related