python: csv to json conversion when csv contains unicode

Question

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:

import csv
import json

originalfilename, file_stream = db.tablename.file.retrieve(info.file) 
file_contents =   file_stream.read()

csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])

This produces the following error:

'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:

Traceback (most recent call last):
  File ".../web2py/gluon/restricted.py", line 212, in restricted
    exec ccode in environment
  File ".../controllers/default.py", line 2345, in <module>
  File ".../web2py/gluon/globals.py", line 194, in <lambda>
    self._caller = lambda f: f()
  File ".../web2py/gluon/tools.py", line 3021, in f
    return action(*a, **b)
  File ".../controllers/default.py", line 697, in generate_vis
    request.vars.json = json.dumps(list(csv_reader))
  File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.

Is this Python 2 or 3? Please do include the *full* traceback. — Martijn Pieters, Jul 07 '13 at 17:34
You don't need to use a list comprehension where a simple `list()` call would do: `json.dumps(list(csv_reader))` would be more efficient. — Martijn Pieters, Jul 07 '13 at 17:35
Last but not least, you'll need to share how you read the file with us. What web framework is this? — Martijn Pieters, Jul 07 '13 at 17:36
python pandas offers very convenient way or handling csv files: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html, if that's of any help — Simon Righley, Jul 07 '13 at 17:44
It's the web2py framework with python 2.7. Edits made to the original to indicate how the file is being read. — Lamps1829, Jul 07 '13 at 17:47
Simon - I'm trying to avoid using pandas as long as it's possible. — Lamps1829, Jul 07 '13 at 17:52
@MartijnPieters - full traceback added, with dir names redacted. — Lamps1829, Jul 07 '13 at 18:05
JSON needs unicode text output. This means that all your *string* input needs to be decodable to Unicode. You have byte data that is not decodable. Decode manually first. What encoding does your data use? — Martijn Pieters, Jul 07 '13 at 18:28
The function in question deals with user-uploaded files (in csv or tab-delimited format), so it's not known in advance what sort of encoding will come up. — Lamps1829, Jul 07 '13 at 18:51

score 3 · Answer 1 · answered Jul 08 '13 at 16:55

The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).

A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.

Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:

class UnicodeDictReader(csv.DictReader):
    def __init__(self, f, encoding, *args, **kwargs):
        csv.DictReader.__init__(self, f, *args, **kwargs)
        self.encoding = encoding
    def next(self):
        return {
            k.decode(self.encoding): v.decode(self.encoding)
            for (k, v) in csv.DictReader.next(self).items()
        }

csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))

it's not known in advance what sort of encoding will come up

Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.

score 2 · Answer 2 · answered Jul 07 '13 at 18:49

2

Try replacing your final line with

json = json.dumps([x.encode('utf-8') for x in csv_reader])

answered Jul 07 '13 at 18:49

ChrisProsser

12,598
6
35
44

The specific character that's causing the issue in this particular case is '\xa0'; encode('utf-8') produces an error when encountering it. – Lamps1829 Jul 07 '13 at 19:23

score 1 · Answer 3 · answered Jul 07 '13 at 19:42

1

Running unidecode over the file contents seems to do the trick:

from isounidecode import unidecode

...

file_contents =   unidecode(file_stream.read())

...

Thanks, everyone!

answered Jul 07 '13 at 19:42

Lamps1829

2,231
3
24
32

1

That's replacing all non-ASCII characters with mangled best-fit ASCII versions - are you sure you want to do that? – bobince Jul 08 '13 at 09:15
You make a good point. It may not be a universal solution, but works for my purposes, and seems to better than some other options at dealing with a multitude of cases in such a way that no error is produced (as opposed to encode(), which gets tripped up by '\xa0', for example). – Lamps1829 Jul 08 '13 at 15:49

python: csv to json conversion when csv contains unicode

3 Answers3