0

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:

import csv
import json

originalfilename, file_stream = db.tablename.file.retrieve(info.file) 
file_contents =   file_stream.read()

csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])

This produces the following error:

'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:

Traceback (most recent call last):
  File ".../web2py/gluon/restricted.py", line 212, in restricted
    exec ccode in environment
  File ".../controllers/default.py", line 2345, in <module>
  File ".../web2py/gluon/globals.py", line 194, in <lambda>
    self._caller = lambda f: f()
  File ".../web2py/gluon/tools.py", line 3021, in f
    return action(*a, **b)
  File ".../controllers/default.py", line 697, in generate_vis
    request.vars.json = json.dumps(list(csv_reader))
  File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.

Lamps1829
  • 2,231
  • 3
  • 24
  • 32

3 Answers3

3

The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).

A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.

Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:

class UnicodeDictReader(csv.DictReader):
    def __init__(self, f, encoding, *args, **kwargs):
        csv.DictReader.__init__(self, f, *args, **kwargs)
        self.encoding = encoding
    def next(self):
        return {
            k.decode(self.encoding): v.decode(self.encoding)
            for (k, v) in csv.DictReader.next(self).items()
        }

csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))

it's not known in advance what sort of encoding will come up

Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.

bobince
  • 528,062
  • 107
  • 651
  • 834
2

Try replacing your final line with

json = json.dumps([x.encode('utf-8') for x in csv_reader])
ChrisProsser
  • 12,598
  • 6
  • 35
  • 44
  • The specific character that's causing the issue in this particular case is '\xa0'; encode('utf-8') produces an error when encountering it. – Lamps1829 Jul 07 '13 at 19:23
1

Running unidecode over the file contents seems to do the trick:

from isounidecode import unidecode

...

file_contents =   unidecode(file_stream.read())

...

Thanks, everyone!

Lamps1829
  • 2,231
  • 3
  • 24
  • 32
  • 1
    That's replacing all non-ASCII characters with mangled best-fit ASCII versions - are you sure you want to do that? – bobince Jul 08 '13 at 09:15
  • You make a good point. It may not be a universal solution, but works for my purposes, and seems to better than some other options at dealing with a multitude of cases in such a way that no error is produced (as opposed to encode(), which gets tripped up by '\xa0', for example). – Lamps1829 Jul 08 '13 at 15:49