Python 2.7: Names of unicode representations

Question

What are the names for these different kinds of ascii representations of unicode?

\xF0\x9F\x98\xA2
\U0001f622

And is there a term for the set that they belong to that's more specific than "representation"? And in the context of these, how would I describe the non-ascii representation ()?

Since I don't know what to call them it is very hard to search for how to work with them.

Thanks!

This seems like a language-specific question or at least would have language-specific answers. Also, where the hex byte format is allowed, it's not a given that the bytes are interpreted as Unicode. For example, not allowed in C#. In JavaScript, it represents bytes from ISO 8859-1 that are then put into the string as Unicode characters. — Tom Blodget, Oct 10 '17 at 21:18

Mantas Kandratavičius · Accepted Answer · 2017-10-10T22:34:46.920

1

As Tom Blodget already warned you, this is a somewhat python specific answer.

The leading \ shows that it's an escape sequence.

\x means that the next two characters will be interpreted as a hex digit.

\U means that the next eight characters will be interpreted as a 32-bit hex value.

You can read more about that here:

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

To fully answer your question:

\xF0\x9F\x98\xA2 are simply four ASCII characters and you have their hex values
\U0001f622 is a UNICODE codepoint encoded with a 32-bit hex value
is a glyph or simply a special character.

edited Oct 10 '17 at 22:34

answered Oct 10 '17 at 21:42

Mantas Kandratavičius

816
1
8
24

1

The first is also the UTF-8 encoding of the second. – Robᵩ Oct 10 '17 at 21:47
2

Hex values outside the range `\x00` to `\x7f` aren't ASCII. That's not a 16-bit hex value, it's 32 bits. And the technical term is "codepoint", not "character". – Mark Ransom Oct 10 '17 at 22:20
Yup, I confused hex values with hex digits. 8 Hex digits means 32-bits, you're right, I edited the answer. – Mantas Kandratavičius Oct 10 '17 at 22:24
@MarkRansom Do you mean for the second bullet point in this answer it should read _"`\U0001f622` is a UNICODE codepoint"_ ? – Nathan Hinchey Oct 10 '17 at 22:27
@NathanHinchey exactly. – Mark Ransom Oct 10 '17 at 22:27

Tom Blodget · Answer 2 · 2017-10-11T16:32:50.337

For Python 3

First there seems to be a misunderstanding about the hex escapes:

print("\xF0\x9F\x98\xA2" == "\u00F0\u009F\u0098\u00A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\N{LATIN SMALL LETTER ETH}\N{APPLICATION PROGRAM COMMAND}\N{START OF STRING}\N{CENT SIGN}")

and for completeness (I recall using octal effectively in machine code where some instructions had 3-bit, aligned arguments but I don't see the point in real programming):

print("\xF0\x9F\x98\xA2" == "\360\237\230\242")

It appears they are all Unicode codepoint escapes in 2-digit hexadecimal, 4-digit hexadecimal, and 8-digit hexadecimal, with ranges from U+0000 to U+00FF, U+FFFF, and U+10FFFF, respectively.

We can confirm that, unlike other languages where the \u for is for a UTF-16 code unit, in Python 3, it is really a codepoint.

print("\ud83d\ude22" == "\U0000d83d\U0000de22")

and for completeness:

print("\U0001f622" == "")
print("\N{CRYING FACE}" == "")

In other languages (where they would be two UTF-16 code units), "\ud83d\ude22" would equal "".

Now, U+D8ED and U+DE22 are Unicode codepoints designated as surrogates. In other words, not characters. They reserve the codepoint codespace for the UTF-16 code units with corresponding values. This is the way the USC-2 encoding of Unicode was transparently extended to UTF-16 when Unicode was expanded from 2^16 codepoints to 2^21 codepoints. For more information see the Unicode FAQ.

As @Robᵩ points out, you can have a bytestring literal, too:

print("\U0001f622".encode("utf-8") == b"\xF0\x9F\x98\xA2")

The treatment of `\u` may depend on the version and build of Python you're using. For me, `u"\ud83d\ude22" == u'\U0001f622'` yields `True` on 2.7 but `False` on 3.6. — Mark Ransom, Oct 11 '17 at 15:43

Python 2.7: Names of unicode representations

2 Answers2