I was trying to use csv.DictReader to parse UTF-8 data with special characters but I was getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I read online and found out that Python 2.7's csv library doesn't handle Unicode. I looked for an alternative library and found unicodecsv
.
I replaced csv with unicodecsv but I get the same error. Here's a simplified version of my code:
from io import StringIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL
data = (
'first_name,last_name,email\r'
'Elmer,Fudd,elmer@looneytunes.com\r'
'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,joaoantonio@araujo.com\r'
)
unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)
class CustomDialect(Dialect):
delimiter = ','
doublequote = True
escapechar = '\\'
lineterminator = '\r\n'
quotechar = '"'
quoting = QUOTE_MINIMAL
skipinitialspace = True
rows = DictReader(unicode_data, dialect=CustomDialect)
for row in rows:
print row
If I replace StringIO with BytesIO, the encoding works but I can't send the newlines
argument anymore and then I get:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Does anybody have any idea how I could solve this? Shouldn't unicodecsv be handling StringIO? Thanks