0

FOR PYTHON 2.7 (I took a shot of using encode in 3 and am all confused now...would love some advice how to replicate this test in python 3....)

For the Euro character (€) I looked up what its utf8 Hex code point was using this tool. It said it was 0x20AC.

For Latin1 (again using Python2 2.7), I used decode to get its Hex code point:

>>import unicodedata
>>p='€'
## notably x80 seems to correspond to [Windows CP1252 according to the link][2]
>>p.decode('latin-1') 
>>u'\x80'

Then I used this print statement for both of them, and this is what I got:

for utf8:

>>> print unichr(0x20AC).encode('utf-8')
€

for latin-1:

>>> print unichr(0x80).encode('latin-1')
€

What in the heck happened? Why did encode return '€' for utf-8? Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect). But the presence of different code points seems to suggest otherwise to me...HOWEVER the reason why python 2.7 is reading the Windows CP1252 'x80' is a real mystery to me....is this the standard for latin-1 in python 2.7??

user14696
  • 657
  • 2
  • 10
  • 30
  • 3
    First, the UTF-8 for the Euro character is `'\xE2\x82\xAC'`. In UTF-8, `'\x20\xAC'` is a space followed by an illegal character. – abarnert Dec 11 '13 at 01:47
  • Are you saying I need to enter '\xe2\x82\xac' ?? that doesn't make sense to me.... or python. I get "SyntaxError: invalid syntax" when I put in: >>> print unichr('\xe2\x82\xac').encode('utf-8') – user14696 Dec 11 '13 at 01:55
  • 1
    First, you didn't put the quotes in. Second, `unichr` takes a number, not a string, so that isn't going to work anyway. To convert a string of bytes into a `unicode`, you need to use the `decode` method—and specify the encoding you want to convert from. So, `print '\xe2\x82\xac'.decode('utf-8').encode('utf-8')` will give you… exactly what you started with, as you'd expect. – abarnert Dec 11 '13 at 01:56
  • Sorry edit. It says: SyntaxError: invalid syntax – user14696 Dec 11 '13 at 01:57
  • Then you copied and pasted wrong. If I paste that exact code into my Python 2.7 interpreter, or an online interpreter like [this one](http://ideone.com/8tckH2), there is no `SyntaxError`; it prints out `€` on a machine with a UTF-8 console, `€` on a machine with a CP-1252 console, etc. – abarnert Dec 11 '13 at 02:08
  • Are you still baffled on the question at the end? Just in case: CP1252 isn't "the standard for Latin-1 in Python 2.7". Your terminal (aka DOS prompt) is what's responsible for converting your keystrokes to bytes to send to Python. Since you're on a Windows box with CP1252 as the "OEM code page", your terminal encodes everything as CP1252. If you let Python decode it as CP1252, or just treat it as bytes and pass it through untouched, everything is fine. If you try to treat it as Latin-1, you will get it right for many characters, but wrong for a few dozen, including the Euro symbol. – abarnert Dec 11 '13 at 18:48

1 Answers1

5

You've got some serious misunderstandings here. If you haven't read the Unicode HOWTOs for Python 2 and Python 3, you should start there.

First, UTF-8 is an encoding of Unicode to 8-bit bytes. There is no such thing as UTF-8 code point 0x20AC. There is a Unicode code point U+20AC, but in UTF-8, that's three bytes: 0xE2, 0x82, 0xAC.


And that explains your confusion here:

Why did encode return '€' for utf-8?

It didn't. It returned the byte string '\xE2\x82\xAC'. You then printed that out to your console. Your console is presumably in CP-1252, so it interpreted those bytes as if they were CP-1252, which gave you €.


Meanwhile, when you write this:

p='€'

The console isn't giving Python Unicode, it's giving Python bytes in CP-1252, which Python just stores as bytes. The CP-1252 for the Euro sign is \x80. So, this is the same as typing:

p='\x80'

But in Latin-1, \x80 isn't the Euro sign, it's an invisible control character, equivalent to Unicode U+0080. So, when you call p.decode('latin-1'), you get back u'\x80'. Which is exactly what you're seeing.


The reason you can't reproduce this in Python 3 is that in Python 3, str, and plain string literals, are Unicode strings, not byte strings. So, when you write this:

p='€'

… the console gives Python some bytes, which Python then automatically decodes with the character set it guessed for the console (CP-1252) into Unicode. So, it's equivalent to writing this:

p='\u20ac'

… or this:

p=b'\x80'.decode(sys.stdin.encoding)

Also, you keep saying "hex code points" to mean a variety of different things, none of which make any sense.

A code point is a Unicode concept. A unicode string in Python is a sequence of code points. A str is a sequence of bytes, not code points. Hex is just a way of representing a number—the hex number 20AC, or 0x20AC, is the same thing as the decimal number 8364, and the hex number 0x80 is the same thing as the decimal number 128.

That sequence of bytes doesn't have any inherent meaning as text on its own; it needs to be combined with an encoding to have a meaning. Depending on the encoding, some code points may not be representable at all, and others may take 2 or more bytes to represent.


Finally:

Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect).

Latin-1 is a superset of ASCII. Unicode is also a superset of the printable subset of Latin-1; some of the Unicode characters up to U+FF (and all printable characters up to U+7F) are encoded in UTF-8 as the byte with the same value as the code point, but not all. CP-1252 is a different superset of the printable subset of Latin-1. Since there is no Euro sign in either ASCII or Latin-1, it's perfectly reasonable for CP-1252 and UTF-8 to represent it differently.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • There's a mistake in your last paragraph: Unicode, not UTF-8, is a superset of the printable parts of Latin-1. – jwodder Dec 11 '13 at 02:06
  • What are 'all the printable parts of Latin-1'? I've read the python 2 How-To a dozen of times (i.e. esp "...Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points 0-255 are identical to the Latin-1 values.."). I'm sorry for using 'hex code points' incorrectly (yes, I am referring to the Unicode code point) and that is what I am confused by -- how can the code point for Latin-1 Euro character be different then the utf-8 character if utf8 is a 'superset' of Latin-1?? – user14696 Dec 11 '13 at 02:11
  • @jwodder: Thanks; I rewrote it to be clearer, I hope. – abarnert Dec 11 '13 at 02:12
  • 1
    @user14696: See [the Wikipedia article](http://en.wikipedia.org/wiki/ISO/IEC_8859-1). Latin-1 actually doesn't define 0x80-0x9F, so it's not really true that Unicode code point U+80 is identical to Latin-1 0x80, because there _is_ no Latin-1 0x80. But Unicode defines 0x80-09F to be reserved control characters, so that's not a big deal. – abarnert Dec 11 '13 at 02:13
  • 1
    @user14696: Meanwhile, even the original version of the last paragraph explained that there is no Euro sign in Latin-1. You're not using Latin-1, you're using CP-1252, a Latin-1 extension that defines (parts of) 0x80-0x9F for characters that don't exist in Latin-1, and which map all over the place in Unicode. The CP-1252 0x80 is U+20AC, `€`; the Latin-1 0x80 is U+0080, a reserved control character. – abarnert Dec 11 '13 at 02:16
  • @abarnet: Thanks...I think I am starting to see the 'light' here. I scrolled through the unicode points you mentioned here (i.e. U+FF) and I definitely think I see what you are talking about. Thanks so much for clarifyin! http://www.utf8-chartable.de/unicode-utf8-table.pl – user14696 Dec 11 '13 at 03:11