For Python 3
First there seems to be a misunderstanding about the hex escapes:
print("\xF0\x9F\x98\xA2" == "\u00F0\u009F\u0098\u00A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\N{LATIN SMALL LETTER ETH}\N{APPLICATION PROGRAM COMMAND}\N{START OF STRING}\N{CENT SIGN}")
and for completeness (I recall using octal effectively in machine code where some instructions had 3-bit, aligned arguments but I don't see the point in real programming):
print("\xF0\x9F\x98\xA2" == "\360\237\230\242")
It appears they are all Unicode codepoint escapes in 2-digit hexadecimal, 4-digit hexadecimal, and 8-digit hexadecimal, with ranges from U+0000 to U+00FF, U+FFFF, and U+10FFFF, respectively.
We can confirm that, unlike other languages where the \u for is for a UTF-16 code unit, in Python 3, it is really a codepoint.
print("\ud83d\ude22" == "\U0000d83d\U0000de22")
and for completeness:
print("\U0001f622" == "")
print("\N{CRYING FACE}" == "")
In other languages (where they would be two UTF-16 code units), "\ud83d\ude22"
would equal ""
.
Now, U+D8ED and U+DE22 are Unicode codepoints designated as surrogates. In other words, not characters. They reserve the codepoint codespace for the UTF-16 code units with corresponding values. This is the way the USC-2 encoding of Unicode was transparently extended to UTF-16 when Unicode was expanded from 2^16 codepoints to 2^21 codepoints. For more information see the Unicode FAQ.
As @Robᵩ points out, you can have a bytestring literal, too:
print("\U0001f622".encode("utf-8") == b"\xF0\x9F\x98\xA2")