UTF8 String is insufficient to describe the statement '\xe0\xb8\x9a\xe0\xb8\x99'
is; it really should be called UTF8 encoding of a unicode string.
Python 2's unicode
type and Python 3's str
type represents a string of unicode code points, so the statement u'\u0e1a\u0e19'
is the python representation of the two code points U+0E1A U+0E19
and in human terms it will be rendered as บน
.
As for explaining the whole encode
and decode
calls, we will use your example. What you got back from Java is a stream of raw bytes, and so to make it useful as human text you need to decode
'\xe0\xb8\x9a\xe0\xb8\x99'
as a utf-8
encoded input in order to get that back into what unicode code points they represent (which is u'\u0e1a\u0e19'
). Calling encode
on that string of unicode code points back into a list of bytes (which in Python 2 it will be in str
type and Python 3 it will be actually be the bytes
type) will get back to the series of bytes that is '\xe0\xb8\x9a\xe0\xb8\x99'
.
Of course, you can encode those unicode code points into other encoding such as UTF16 encoding which on little endian platforms it will result in the bytes '\xff\xfe\x1a\x0e\x19\x0e'
, or use encode those code points into non-unicode encoding. As this looks like Thai we can use the iso8859-11
encoding for this, which will be encoded into the bytes '\xba\xb9'
- but this is not cross platform as it will only be shown as Thai on systems configured for this particular encoding. This is one of the reasons why Unicode was invented as these bytes '\xba\xb9'
could be decoded using the iso8859-1
encoding which would be rendered as º¹
or iso8859-11
as บน
.
In short, '\xe0\xb8\x9a\xe0\xb8\x99'
is the UTF8 encoding of the unicode code points for u'\u0e1a\u0e19'
in Python syntax. Raw bytes (coming through the wire, read from a file) are generally not in the form of unicode code points and they must be decoded into unicode code points. Unicode code points are not an encoding and when sent across the wire (or written to a file) must be encoded into some kind of byte representation for the unicode code points, which in many cases is utf-8 as it has the greatest portability.
Lastly, you should read this: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)