Error when using unicode function in the gspread wrapper. Potentially and bug

Question

When using the unicode function with the following string it gives an error:

unicode('All but Buitoni are using Pinterest buffers and Pratt & Lamber haven’t used it for a month so I’ll check on this.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 68: ordinal not in range(128)

When I check position 68 it appears to be the apostroph ':

>>> str='All but Buitoni are using Pinterest buffers and Pratt & Lamber haven’t used it for a month so I’ll check on this.'
>>> str[62:75]
' haven\xe2\x80\x99t us'

Is there a way to deal with this issue. I found this bug in the gspread wrapper in the file models.py on line 426. Here is the line:

425 cell_elem = feed.find(_ns1('cell'))
426 cell_elem.set('inputValue', unicode(val))
427 uri = self._get_link('edit', feed).get('href')

So once I try to update a cell with a value, string in this case, the gspread wrapper tries to convert it into unicode, but cannot do so because of the apostroph. Potentially, it is a bug. How to deal with this issue? Thanks for the help.

@PadraicCunningham you mean instead of the apostrophe? I am pulling data from a database. Probably, I could a little code in the models.py file to replace `’` with `"` — Koba, Jul 30 '14 at 18:03
Can you edit the string in question before the error is raised? — dwitvliet, Jul 30 '14 at 18:15
@Banana I actually solved this issue by adding some code to the gspread wrapper. Namely `if isinstance(val, basestring): val = re.sub(r'(’)','',val)` — Koba, Jul 30 '14 at 18:22
@Koba A better solution, if you want to keep all of your string: add `val.decode('utf-8')` right before `unicode(val)`, then sometime later do `val.encode('utf-8')`. — dwitvliet, Jul 30 '14 at 18:24

score 0 · Answer 1 · answered Jul 30 '14 at 18:58

There's no need to replace the character. Just properly decode the encoded string to unicode:

>>> s = 'All but Buitoni are using Pinterest buffers and Pratt & Lamber haven’t used it for a month so I’ll check on this.'
>>> s.decode('utf-8')
u'All but Buitoni are using Pinterest buffers and Pratt & Lamber haven\u2019t used it for a month so I\u2019ll check on this.'  # unicode object

You need to tell python what encoding your str object is using in order to convert it to unicode, rather than just using unicode(some_str) directly. In this case, your string is encoded with UTF-8. Using this approach will scale better than trying to replace characters, because you won't need a special case for every unicode character that exists in the DB.

IMO, the best practice for dealing with unicode in Python is this:

Decode strings to unicode from external sources (like a DB) as early as possible.
Use them as unicode objects internally.
Encode them back to byte strings only when you need to send them to an external location (a file, a DB, a socket, etc.)

I'd also recommend checking out this slide deck, which gives a really good overview of how to deal with unicode in Python.

Error when using unicode function in the gspread wrapper. Potentially and bug

1 Answers1