You've got some serious misunderstandings here. If you haven't read the Unicode HOWTOs for Python 2 and Python 3, you should start there.
First, UTF-8 is an encoding of Unicode to 8-bit bytes. There is no such thing as UTF-8 code point 0x20AC
. There is a Unicode code point U+20AC, but in UTF-8, that's three bytes: 0xE2
, 0x82
, 0xAC
.
And that explains your confusion here:
Why did encode return '€' for utf-8?
It didn't. It returned the byte string '\xE2\x82\xAC'
. You then print
ed that out to your console. Your console is presumably in CP-1252, so it interpreted those bytes as if they were CP-1252, which gave you €
.
Meanwhile, when you write this:
p='€'
The console isn't giving Python Unicode, it's giving Python bytes in CP-1252, which Python just stores as bytes. The CP-1252 for the Euro sign is \x80
. So, this is the same as typing:
p='\x80'
But in Latin-1, \x80
isn't the Euro sign, it's an invisible control character, equivalent to Unicode U+0080. So, when you call p.decode('latin-1')
, you get back u'\x80'
. Which is exactly what you're seeing.
The reason you can't reproduce this in Python 3 is that in Python 3, str
, and plain string literals, are Unicode strings, not byte strings. So, when you write this:
p='€'
… the console gives Python some bytes, which Python then automatically decodes with the character set it guessed for the console (CP-1252) into Unicode. So, it's equivalent to writing this:
p='\u20ac'
… or this:
p=b'\x80'.decode(sys.stdin.encoding)
Also, you keep saying "hex code points" to mean a variety of different things, none of which make any sense.
A code point is a Unicode concept. A unicode
string in Python is a sequence of code points. A str
is a sequence of bytes, not code points. Hex is just a way of representing a number—the hex number 20AC
, or 0x20AC
, is the same thing as the decimal number 8364
, and the hex number 0x80
is the same thing as the decimal number 128
.
That sequence of bytes doesn't have any inherent meaning as text on its own; it needs to be combined with an encoding to have a meaning. Depending on the encoding, some code points may not be representable at all, and others may take 2 or more bytes to represent.
Finally:
Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect).
Latin-1 is a superset of ASCII. Unicode is also a superset of the printable subset of Latin-1; some of the Unicode characters up to U+FF (and all printable characters up to U+7F) are encoded in UTF-8 as the byte with the same value as the code point, but not all. CP-1252 is a different superset of the printable subset of Latin-1. Since there is no Euro sign in either ASCII or Latin-1, it's perfectly reasonable for CP-1252 and UTF-8 to represent it differently.