0

When using the unicode function with the following string it gives an error:

unicode('All but Buitoni are using Pinterest buffers and Pratt & Lamber haven’t used it for a month so I’ll check on this.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 68: ordinal not in range(128)

When I check position 68 it appears to be the apostroph ':

>>> str='All but Buitoni are using Pinterest buffers and Pratt & Lamber haven’t used it for a month so I’ll check on this.'
>>> str[62:75]
' haven\xe2\x80\x99t us'

Is there a way to deal with this issue. I found this bug in the gspread wrapper in the file models.py on line 426. Here is the line:

425 cell_elem = feed.find(_ns1('cell'))
426 cell_elem.set('inputValue', unicode(val))
427 uri = self._get_link('edit', feed).get('href')

So once I try to update a cell with a value, string in this case, the gspread wrapper tries to convert it into unicode, but cannot do so because of the apostroph. Potentially, it is a bug. How to deal with this issue? Thanks for the help.

Koba
  • 1,514
  • 4
  • 27
  • 48

1 Answers1

0

There's no need to replace the character. Just properly decode the encoded string to unicode:

>>> s = 'All but Buitoni are using Pinterest buffers and Pratt & Lamber haven’t used it for a month so I’ll check on this.'
>>> s.decode('utf-8')
u'All but Buitoni are using Pinterest buffers and Pratt & Lamber haven\u2019t used it for a month so I\u2019ll check on this.'  # unicode object

You need to tell python what encoding your str object is using in order to convert it to unicode, rather than just using unicode(some_str) directly. In this case, your string is encoded with UTF-8. Using this approach will scale better than trying to replace characters, because you won't need a special case for every unicode character that exists in the DB.

IMO, the best practice for dealing with unicode in Python is this:

  1. Decode strings to unicode from external sources (like a DB) as early as possible.
  2. Use them as unicode objects internally.
  3. Encode them back to byte strings only when you need to send them to an external location (a file, a DB, a socket, etc.)

I'd also recommend checking out this slide deck, which gives a really good overview of how to deal with unicode in Python.

dano
  • 91,354
  • 19
  • 222
  • 219