utf-16-le BOM csv files

Question

I'm downloading some CSV files from playstore (stats etc) and want to process with python.

cromestant@jumphost-vpc:~/stat_dev/bime$ file -bi stats/installs/*
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le

As you can see they are utf-16le.

I have some code on python 2.7 that works on some files and not on others:

import codecs
.
.
fp =codecs.open(dir_n+'/'+file_n,'r',"utf-16")
 for line in fp:
  #write to mysql db

This works until:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128)

What is the proper way to do this? I've seen "re encode" use cvs module etc. but csv module does not handle encoding by itself, so it seems overkill for just dumping to a database

I don't think the csv module and Unicode play well together. Things have probably improved in Python 3. — Mark Ransom, May 04 '15 at 22:23
yes that is why I stated that it does not handle it ( they recomend using filter functions etc..) so in essence how can I properly handle these files. — cromestant, May 04 '15 at 22:31
If the tool is fundamentally incapable of handling the job, it's time to find a different tool. I've suggested one possible approach already. — Mark Ransom, May 04 '15 at 23:07
I must be having trouble explaining myself, I meant that the answers I've seen always refer to it, but it would seem that the module is incapable. However some good answers below i'll be testing. — cromestant, May 05 '15 at 12:24
and it is not that I do not want to use python 3, I have some other dependent modules that are not yet migrated and just a lack of time to do it — cromestant, May 05 '15 at 12:24

score 4 · Accepted Answer · answered May 05 '15 at 02:48

Have you tried codecs.EncodedFile?

with open('x.csv', 'rb') as f:
    g = codecs.EncodedFile(f, 'utf8', 'utf-16le', 'ignore')
    c = csv.reader(g)
    for row in c:
        print row
        # and if you want to use unicode instead of str:
        row = [unicode(cell, 'utf8') for cell in row]

score 3 · Answer 2 · answered May 05 '15 at 02:15

3

What is the proper way to do this?

The proper way is to use Python3, in which Unicode support is vastly more rational.

As a work-around, if you are allergic to Python3 for some reason, the best compromise is to wrap csv.reader(), like so:

import codecs
import csv

def to_utf8(fp):
    for line in fp:
        yield line.encode("utf-8")

def from_utf8(fp):
    for line in fp:
        yield [column.decode('utf-8') for column in line]

with codecs.open('utf16le.csv','r', 'utf-16le') as fp:
    reader = from_utf8(csv.reader(to_utf8(fp)))
    for line in reader:
        #"line" is a list of unicode strings
        #write to mysql db
        print line

answered May 05 '15 at 02:15

Robᵩ

163,533
20
239
308

tried this, it yields the same error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128) – cromestant May 05 '15 at 13:16
What is the entire traceback, with line numbers included? – Robᵩ May 05 '15 at 13:57
Traceback (most recent call last): File "bime.py", line 138, in main() File "bime.py", line 126, in main daily_user_uninstalls=parsed[10]) UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128) – cromestant May 05 '15 at 14:15

utf-16-le BOM csv files

2 Answers2