The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f'
into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\U0001f64f'
() back to '\ud83d\ude4f'
. I couldn't find a clear answer to that.
Asked
Active
Viewed 4,501 times
12

Reblochon Masque
- 35,405
- 10
- 55
- 80

hilssu
- 416
- 4
- 18
-
Do you absolutely need the (technically invalid) `'\ud83d\ude4f'` string, or would the UTF-16 encoding do? – Martijn Pieters Oct 24 '16 at 16:14
-
I'm not sure, but I think so. Typing print('\U0001f64f') on the IDLE shell will raise an error message "Non-BMP character not supported in Tk", but typing print('\ud83d\ude4f') (on IDLE) will in fact print the non-BMP emoji character to the IDLE shell, which is supposed to be impossible. – hilssu Oct 24 '16 at 16:19
-
Printing non-BMP characters onto the IDLE screen is supposedly impossible, but using surrogate pairs at least some of them are printable. That's why I need the "technically invalid" string '\ud83d\ude4f'. If you know another way to print the character to IDLE (using UTF-18 encoding perhaps), that's fine, but finding the surrogate pair will do. – hilssu Oct 24 '16 at 16:28
-
Note that you normally *don't* want to have raw surrogate characters in *normal* Python string. Sometimes Python use them for other purposes (see [PEP 0383](https://www.python.org/dev/peps/pep-0383/), and try running `hex(ord(b"\x90".decode('u8', "surrogateescape")))` (→ 0xDC90) -------- Instead, use the UTF-16 encoded `bytes` object, or just a list of int UTF16 codepoints.. – user202729 Oct 12 '21 at 06:11
-
In fact, in new Python versions this is no longer really needed as IDLE now somewhat supports non-BMP characters. Not perfectly, editing lines with non-BMP characters results in weird behavior, but at least they can be printed and pasted without errors or crashing. I'm currently using Python 3.9.1 on Windows 10 (and emojis can be pasted and printed without any need for surrogate pairs), but anyone using, say, Python 3.6, may still find this page useful. – hilssu Nov 08 '21 at 18:50
2 Answers
5
You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:
import re
_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')
def _surrogatepair(match):
char = match.group()
assert ord(char) > 0xffff
encoded = char.encode('utf-16-le')
return (
chr(int.from_bytes(encoded[:2], 'little')) +
chr(int.from_bytes(encoded[2:], 'little')))
def with_surrogates(text):
return _nonbmp.sub(_surrogatepair, text)
Demo:
>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'

Martijn Pieters
- 1,048,767
- 296
- 4,058
- 3,343
-
If you already know you have a code point outside of the BMP, then of course the regex part is not necessary. Just `x = char.encode('utf-16-le'); return [chr(int.from_bytes(y, 'little')) for y in (x[0:2], x[2:4])]` – tripleee May 13 '19 at 08:11
3
It's a little complex, but here's a one-liner to convert a single character:
>>> emoji = '\U0001f64f'
>>> ''.join(chr(x) for x in struct.unpack('>2H', emoji.encode('utf-16be')))
'\ud83d\ude4f'
To convert a mix of characters requires surrounding that expression with another:
>>> emoji_str = 'Here is a non-BMP character: \U0001f64f'
>>> ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in emoji_str)
'Here is a non-BMP character: \ud83d\ude4f'

Mark Ransom
- 299,747
- 42
- 398
- 622
-
I stayed away from `str.join()` for just two values; I found using two `chr()` calls to be more readable; I didn't test this on speed however. Using your one-liner to process each character one by one in a `for` loop is going to be very slow compared to a `re.sub()` approach (which can scan text in a C loop). – Martijn Pieters Nov 01 '16 at 18:11
-
Remark: `struct.unpack` this way makes it work for exactly one emoji character. For a string it's possible to use `x=array.array("H"); x.frombytes(
);` – user202729 Oct 12 '21 at 06:19