1

What are the names for these different kinds of ascii representations of unicode?

  • \xF0\x9F\x98\xA2
  • \U0001f622

And is there a term for the set that they belong to that's more specific than "representation"? And in the context of these, how would I describe the non-ascii representation ()?

Since I don't know what to call them it is very hard to search for how to work with them.

Thanks!

Nathan Hinchey
  • 1,191
  • 9
  • 30
  • 1
    This seems like a language-specific question or at least would have language-specific answers. Also, where the hex byte format is allowed, it's not a given that the bytes are interpreted as Unicode. For example, not allowed in C#. In JavaScript, it represents bytes from ISO 8859-1 that are then put into the string as Unicode characters. – Tom Blodget Oct 10 '17 at 21:18

2 Answers2

1

As Tom Blodget already warned you, this is a somewhat python specific answer.


The leading \ shows that it's an escape sequence.

\x means that the next two characters will be interpreted as a hex digit.

\U means that the next eight characters will be interpreted as a 32-bit hex value.

You can read more about that here:

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

To fully answer your question:

  • \xF0\x9F\x98\xA2 are simply four ASCII characters and you have their hex values
  • \U0001f622 is a UNICODE codepoint encoded with a 32-bit hex value
  • is a glyph or simply a special character.
1

For Python 3

First there seems to be a misunderstanding about the hex escapes:

print("\xF0\x9F\x98\xA2" == "\u00F0\u009F\u0098\u00A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\N{LATIN SMALL LETTER ETH}\N{APPLICATION PROGRAM COMMAND}\N{START OF STRING}\N{CENT SIGN}")

and for completeness (I recall using octal effectively in machine code where some instructions had 3-bit, aligned arguments but I don't see the point in real programming):

print("\xF0\x9F\x98\xA2" == "\360\237\230\242")

It appears they are all Unicode codepoint escapes in 2-digit hexadecimal, 4-digit hexadecimal, and 8-digit hexadecimal, with ranges from U+0000 to U+00FF, U+FFFF, and U+10FFFF, respectively.

We can confirm that, unlike other languages where the \u for is for a UTF-16 code unit, in Python 3, it is really a codepoint.

print("\ud83d\ude22" == "\U0000d83d\U0000de22")

and for completeness:

print("\U0001f622" == "")
print("\N{CRYING FACE}" == "")

In other languages (where they would be two UTF-16 code units), "\ud83d\ude22" would equal "".

Now, U+D8ED and U+DE22 are Unicode codepoints designated as surrogates. In other words, not characters. They reserve the codepoint codespace for the UTF-16 code units with corresponding values. This is the way the USC-2 encoding of Unicode was transparently extended to UTF-16 when Unicode was expanded from 2^16 codepoints to 2^21 codepoints. For more information see the Unicode FAQ.


As @Robᵩ points out, you can have a bytestring literal, too:

print("\U0001f622".encode("utf-8") == b"\xF0\x9F\x98\xA2")
Tom Blodget
  • 20,260
  • 3
  • 39
  • 72
  • 1
    The treatment of `\u` may depend on the version and build of Python you're using. For me, `u"\ud83d\ude22" == u'\U0001f622'` yields `True` on 2.7 but `False` on 3.6. – Mark Ransom Oct 11 '17 at 15:43