UTF8 Encoding and decoding in python

Question

I have a UTF8 String piped from Java to python.

The end result is

'\xe0\xb8\x9a\xe0\xb8\x99'

Hence for example

a = '\xe0\xb8\x9a\xe0\xb8\x99'

a.decode('utf-8')

gives me the result

u'\u0e1a\u0e19'

however, what i am curious is since the bytes is piped in as UTF-8, why would be

'\xe0\xb8\x9a\xe0\xb8\x99'

instead of u'\u0e1a\u0e19'.

If i were to encode (u'\u0e1a\u0e19') i would get back '\xe0\xb8\x9a\xe0\xb8\x99'.

So what is the inherent difference between these two and how i do actually understand when to use decode and encode.

metatoaster · Accepted Answer · 2015-03-19T01:38:18.827

UTF8 String is insufficient to describe the statement '\xe0\xb8\x9a\xe0\xb8\x99' is; it really should be called UTF8 encoding of a unicode string.

Python 2's unicode type and Python 3's str type represents a string of unicode code points, so the statement u'\u0e1a\u0e19' is the python representation of the two code points U+0E1A U+0E19 and in human terms it will be rendered as บน.

As for explaining the whole encode and decode calls, we will use your example. What you got back from Java is a stream of raw bytes, and so to make it useful as human text you need to decode '\xe0\xb8\x9a\xe0\xb8\x99' as a utf-8 encoded input in order to get that back into what unicode code points they represent (which is u'\u0e1a\u0e19'). Calling encode on that string of unicode code points back into a list of bytes (which in Python 2 it will be in str type and Python 3 it will be actually be the bytes type) will get back to the series of bytes that is '\xe0\xb8\x9a\xe0\xb8\x99'.

Of course, you can encode those unicode code points into other encoding such as UTF16 encoding which on little endian platforms it will result in the bytes '\xff\xfe\x1a\x0e\x19\x0e', or use encode those code points into non-unicode encoding. As this looks like Thai we can use the iso8859-11 encoding for this, which will be encoded into the bytes '\xba\xb9' - but this is not cross platform as it will only be shown as Thai on systems configured for this particular encoding. This is one of the reasons why Unicode was invented as these bytes '\xba\xb9' could be decoded using the iso8859-1 encoding which would be rendered as º¹ or iso8859-11 as บน.

In short, '\xe0\xb8\x9a\xe0\xb8\x99' is the UTF8 encoding of the unicode code points for u'\u0e1a\u0e19' in Python syntax. Raw bytes (coming through the wire, read from a file) are generally not in the form of unicode code points and they must be decoded into unicode code points. Unicode code points are not an encoding and when sent across the wire (or written to a file) must be encoded into some kind of byte representation for the unicode code points, which in many cases is utf-8 as it has the greatest portability.

Lastly, you should read this: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

score 2 · Answer 2 · answered Mar 19 '15 at 01:14

2

'\xe0\xb8\x9a\xe0\xb8\x99' is simply a series of bytes. You have chosen to interpret that as UTF-8, and when you do, you can decode it into a series of unicode characters, U+e1a and U+e19.

The sequence U+e1a, U+e19 can be represented as u'\u0e1a\u0e19', but in some sense that representation is as arbitrary as '\xe0\xb8\x9a\xe0\xb8\x99'. It is "natural", that's why Python prints them that way, but it's inefficent, which is why there are various other encoding schemes, including UTF-8

In fact, it's slightly misleading for me to say "'\xe0\xb8\x9a\xe0\xb8\x99' is a series of bytes." It is the default representation of a series of bytes, two hundred twenty-four, followed by one hundred eighty-four, and so on.

Python has a notion of a series of bytes, and it has a separate notion of series of unicode characters. encode and decode represent one way of mapping between those two notions.

Does that help?

answered Mar 19 '15 at 01:14

Michael Lorton

43,060
26
103
144

2

..and even "two hundred twenty-four" is a *decimal representation* of the binary representation 11100000, which is just a *binary representation* of some electrons being pushed through some doped silicon, which is just a *standard model representation* of our somewhat-tenuous understanding of subatomic particles, which something something string theory. – roippi Mar 19 '15 at 01:30
@Malvolio So when do i use decode and encode? Say i was to write this string to a file. Do i have to encode u'\u0e1a\u0e19 as UTF8 or would writing \xe0\xb8\x9a\xe0\xb8\x99 to a file show me the corresponding UTF8 character บน in the file – aceminer Mar 19 '15 at 01:30
@aceminer I expanded my answer significantly to answer your question. – metatoaster Mar 19 '15 at 01:38

UTF8 Encoding and decoding in python

2 Answers2