Why can u'\xe5' be decoded but not '\xe5'?

Question

This is flabbergasting and extremely frustrating, please help.

>>> a1 = '\xe5'   # type <str>
>>> a2 = u'\xe5'  # type <unicode>
>>> ord(a1)
229
>>> ord(a2)
229
>>> print a2.encode('utf-8')
å
>>> print a1.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

If a1 and a2 have the same value, why can't both be encoded?

I have to use an external API that returns unicode data on the a1 form, which makes it useless. Python apparently insists that <str> typed strings must only contain ASCII chars or it refuses to encode them. It completely breaks my application.

Did you try decoding it using the charset it was encoded with? — Ignacio Vazquez-Abrams, Apr 23 '17 at 20:35
Hmm... It turns out the data is encoded as latin-1. I cannot rationalize the correct guess based on any technical detail I can find. I have no idea how Python, or the terminal, or whatever, decided to use latin-1. I just made a lucky guess. — Klas Lindberg, Apr 23 '17 at 21:05
It didn't. It was encoded that way by whatever generated it. — Ignacio Vazquez-Abrams, Apr 23 '17 at 21:07

score 3 · Answer 1 · answered Apr 23 '17 at 20:38

3

You can only encode Unicode strings. If you call encode on a bytestring, Python tries to decode it first, using the default encoding - hence the error. (Note that this confusing behaviour only occurs in Python 2, it has been removed in Python 3).

answered Apr 23 '17 at 20:38

Daniel Roseman

588,541
66
880
895

Isn't there a way to set the byte string's encoding? The data comes from a terminal that is running with LANG=en_US.utf8. – Klas Lindberg Apr 23 '17 at 20:55
Even better: Is there no way to cast the byte string into the unicode type without running any conversion? The arrays are byte exact copies, after all. – Klas Lindberg Apr 23 '17 at 20:56

Attie · Answer 2 · 2017-04-23T21:06:07.107

In python2, strings are ASCII, while in python3 strings are Unicode.

ASCII characters may only have a value between 0 and 127 inclusive. Unicode characters however may have a much higher value.

python2:

>>> a = '\x7f'
>>> a.encode('utf-8')
'\x7f'
>>> a = '\x80'
>>> a.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

python3:

>>> a = '\x7f'
>>> a.encode('utf-8')
b'\x7f'
>>> a = '\x80'
>>> a.encode('utf-8')
b'\xc2\x80'

The reason that this works in python2 with the u prefix is because you are explicitly stating that "this is a Unicode string".

It might be worth reading up for a more in-depth understanding of using Unicode in python2:

To make use of the (broken) API, it would be best to convert the returned string into a bytearray, but note, this will not work in python3.

>>> a = '\xe5'
>>> b = bytearray(a)
>>> b[0]
229

Remember, that \xe5 is not a valid Unicode (UTF-8) character... To store the value 0xE5 in a UTF-8 encoded string, you'd need to store two bytes: 0xC3 0xA5.

GIZ · Answer 3 · 2017-04-23T21:12:52.983

Let me tear down your confusion to pieces. Let's start first by the the distinction between str and unicode. In Python 2.X:

str is a string of 8-bit characters (1-byte) that prints as ASCII whenever possible. str is really a sequence of bytes and is the equivalent of bytes in Python 3.X. *There's no encoding for str.
unicode is a string of Unicode code-points.

Second, encoding means according to Python documentation:

"The rules for translating a Unicode string into a sequence of bytes are called an encoding."

Then, ask yourself this question: does it makes sense to encode str which is already a sequence of bytes? The answer is no. Because str is already a sequence of bytes. It does make sense however to encode unicode, why? Because it's a string of Unicode character code-points (i.e, U+00E4').

score 0 · Answer 4 · answered Apr 23 '17 at 23:46

Ignacio's suggestion to decode the byte string from its actual encoding (not ascii, but what?) got me to try with latin-1 even though I think it should have been utf-8. That worked!

I get the data from the Python2.7 curses module. My best guess is the problem is in there somewhere. The terminal's encoding is utf-8, but ok it works now.

Why can u'\xe5' be decoded but not '\xe5'?

4 Answers4