10

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.

I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.

Answers for John Machin questions below

print repr(open('test.csv', 'rb').read(100))

Output with test.csv having just abc as content

'\xff\xfea\x00b\x00c\x00'

I think csv file got created on windows machine in USA. I am using Mac OSX Lion.

If I use code provided by phihag and test.csv containing one record.

sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output

'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'

Code by phihag

import codecs
import csv
with open('test.csv','rb') as f:
      sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))      
      for row in csv.reader(sr):
         print row

Output of the above code

['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']

expected output is

['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
venky
  • 125
  • 1
  • 2
  • 9

4 Answers4

34

At the moment, the csv module does not support UTF-16.

In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:

# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
    for line in csv.reader(csvf):
        print(line) # do something with the line

In Python 2.x, you can recode the input:

# Python 2.x only
import codecs
import csv

class Recoder(object):
    def __init__(self, stream, decoder, encoder, eol='\r\n'):
        self._stream = stream
        self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
        self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
        self._buf = ''
        self._eol = eol
        self._reachedEof = False

    def read(self, size=None):
        r = self._stream.read(size)
        raw = self._decoder.decode(r, size is None)
        return self._encoder.encode(raw)

    def __iter__(self):
        return self

    def __next__(self):
        if self._reachedEof:
            raise StopIteration()
        while True:
            line,eol,rest = self._buf.partition(self._eol)
            if eol == self._eol:
                self._buf = rest
                return self._encoder.encode(line + eol)
            raw = self._stream.read(1024)
            if raw == '':
                self._decoder.decode(b'', True)
                self._reachedEof = True
                return self._encoder.encode(self._buf)
            self._buf += self._decoder.decode(raw)
    next = __next__

    def close(self):
        return self._stream.close()

with open('test.csv','rb') as f:
    sr = Recoder(f, 'utf-16', 'utf-8')

    for row in csv.reader(sr):
        print (row)

open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:

try:
    from io import BytesIO
except ImportError: # Python < 2.6
    from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
    c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
    print(line) # do something with the line
phihag
  • 278,196
  • 72
  • 453
  • 469
  • Thanks @phihag for your response. Is there a way to do this without loading file into memory? My csv file is huge. – venky Feb 07 '12 at 14:53
  • how do I know if the file is starting with a BOM? @phihag – venky Feb 07 '12 at 15:15
  • Try the first method; it will fail with a `UnicodeError` if the stream doesn't. You can also examine the first two bytes of the file; if they are `FE FF` or `FF FE`, that's the BOM. – phihag Feb 07 '12 at 15:21
  • While trying StreamReader option by @phihag csv reader seems to read record partially sometimes. When I open the file in vi. I see <85> in the line where it seems to think it is end of record, but there are two other fields after <85> character. Looks like leftover fields are treated as next record – venky Feb 07 '12 at 15:33
  • Can you upload a demonstration file somewhere? Without it, I can't reproduce the problem. Also, does the second method fail as well when used on the demonstration file? – phihag Feb 07 '12 at 15:36
  • Can I email you the csv sample file? @phihag – venky Feb 07 '12 at 16:08
  • Second method i.e. using BytesIO worked for me with sample file. It is loading the file into memory which I can not do in my case (huge file). @phihag – venky Feb 07 '12 at 16:19
4

The Python 2.x csv module documentation example shows how to handle other encodings.

Antony Hatchkins
  • 31,947
  • 10
  • 111
  • 111
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • 1
    What the documentation actually says is: "So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs." – Antony Hatchkins Oct 22 '12 at 13:24
  • @Antony did you read the final example? It recodes to ANY encoding as UTF-8 before passing it to the csv module. – Mark Tolonen Oct 22 '12 at 14:25
  • Yep, the issue is addressed in just a few lines which do pretty much the same as the code from @phihag's answer. I would quote the example explicitly though - to make reader's life easier :) Downvote removed. – Antony Hatchkins Oct 22 '12 at 15:08
  • This was in addition to phihag's answer and a gentle RTFM :) – Mark Tolonen Oct 22 '12 at 15:39
  • Good addition :) Poorly written `csv` module code (utf16 is not THAT horrible and it is one of the defaults for Excel output) and documentation (it is not obvious that the final example deals with both NULs and utf16 as well) is due to Guido wanting everybody to move to python 3.x, I guess. – Antony Hatchkins Oct 22 '12 at 16:39
3

I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.

Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:

print repr(open('thefile.csv', 'rb').read(100))

Four possible ways of encoding u'abc'

\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le

If you have any trouble with this step, edit your question to include the results of the above print repr()

Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:

import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
    buf = fi.read(BUFSIZ)
    if not buf: break
    if first and enc == 'utf_16':
        bom = buf[:2]
        buf = buf[2:]
        enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
        # KeyError means file doesn't start with a valid BOM
    first = False
    fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()

Other matters:

You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.

The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?

Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net

Update based on 1-line sample data provided.

This confirms my suspicions. Read this. Here's a quote from it:

... the C1 control characters ... are rarely used directly, except on specific platforms such as OpenVMS. When they turn up in documents, Web pages, e-mail messages, etc., which are ostensibly in an ISO-8859-n encoding, their code positions generally refer instead to the characters at that position in a proprietary, system-specific encoding such as Windows-1252 or the Apple Macintosh ("MacRoman") character set that use the codes provided for representation of the C1 set with a single 8-bit byte to instead provide additional graphic characters

This code:

s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
    print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
    c2 = c.encode('latin1').decode('cp1252')
    print "to:   U+%04X %s" % (ord(c2), name(c2, "<no name>"))

s3 = u''.join(
    c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
    for c in s2
    )
print 's3 repr:', repr(s3)
print 's3:', s3

produces the following (Python 2.7.2 IDLE, Windows 7):

s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to:   U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to:   U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to:   U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I

Which do you think is a more reasonable interpretation of \x96:

SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?

Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.

John Machin
  • 81,303
  • 11
  • 141
  • 189
-1

Just open your file with codecs.open like in

import codecs, csv

stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)

And work through your program with unicode strings, as you should do anyway if you are processing text

jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • 1
    for record in csv.reader(stream): line throws exception UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 77: ordinal not in range(128) – venky Feb 07 '12 at 15:09
  • This works fine in Python 3.x (although one could just write `open` instead of `codecs.open`), but fails in 2.x because `csv` tries to re-encode the unicode characters it reads from the stream. – phihag Feb 07 '12 at 15:09