Python: Find equivalent surrogate pair from non-BMP unicode char

Question

The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\U0001f64f' () back to '\ud83d\ude4f'. I couldn't find a clear answer to that.

Do you absolutely need the (technically invalid) `'\ud83d\ude4f'` string, or would the UTF-16 encoding do? — Martijn Pieters, Oct 24 '16 at 16:14
I'm not sure, but I think so. Typing print('\U0001f64f') on the IDLE shell will raise an error message "Non-BMP character not supported in Tk", but typing print('\ud83d\ude4f') (on IDLE) will in fact print the non-BMP emoji character to the IDLE shell, which is supposed to be impossible. — hilssu, Oct 24 '16 at 16:19
Printing non-BMP characters onto the IDLE screen is supposedly impossible, but using surrogate pairs at least some of them are printable. That's why I need the "technically invalid" string '\ud83d\ude4f'. If you know another way to print the character to IDLE (using UTF-18 encoding perhaps), that's fine, but finding the surrogate pair will do. — hilssu, Oct 24 '16 at 16:28
Note that you normally *don't* want to have raw surrogate characters in *normal* Python string. Sometimes Python use them for other purposes (see [PEP 0383](https://www.python.org/dev/peps/pep-0383/), and try running `hex(ord(b"\x90".decode('u8', "surrogateescape")))` (→ 0xDC90) -------- Instead, use the UTF-16 encoded `bytes` object, or just a list of int UTF16 codepoints.. — user202729, Oct 12 '21 at 06:11
In fact, in new Python versions this is no longer really needed as IDLE now somewhat supports non-BMP characters. Not perfectly, editing lines with non-BMP characters results in weird behavior, but at least they can be printed and pasted without errors or crashing. I'm currently using Python 3.9.1 on Windows 10 (and emojis can be pasted and printed without any need for surrogate pairs), but anyone using, say, Python 3.6, may still find this page useful. — hilssu, Nov 08 '21 at 18:50

Martijn Pieters · Accepted Answer · 2016-10-24T16:34:02.743

5

You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:

import re

_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')

def _surrogatepair(match):
    char = match.group()
    assert ord(char) > 0xffff
    encoded = char.encode('utf-16-le')
    return (
        chr(int.from_bytes(encoded[:2], 'little')) + 
        chr(int.from_bytes(encoded[2:], 'little')))

def with_surrogates(text):
    return _nonbmp.sub(_surrogatepair, text)

Demo:

>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'

edited Oct 24 '16 at 16:34

answered Oct 24 '16 at 16:28

Martijn Pieters

1,048,767
296
4,058
3,343

If you already know you have a code point outside of the BMP, then of course the regex part is not necessary. Just `x = char.encode('utf-16-le'); return [chr(int.from_bytes(y, 'little')) for y in (x[0:2], x[2:4])]` – tripleee May 13 '19 at 08:11

score 3 · Answer 2 · answered Oct 24 '16 at 17:23

3

It's a little complex, but here's a one-liner to convert a single character:

>>> emoji = '\U0001f64f'
>>> ''.join(chr(x) for x in struct.unpack('>2H', emoji.encode('utf-16be')))
'\ud83d\ude4f'

To convert a mix of characters requires surrounding that expression with another:

>>> emoji_str = 'Here is a non-BMP character: \U0001f64f'
>>> ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in emoji_str)
'Here is a non-BMP character: \ud83d\ude4f'

answered Oct 24 '16 at 17:23

Mark Ransom

299,747
42
398
622

I stayed away from `str.join()` for just two values; I found using two `chr()` calls to be more readable; I didn't test this on speed however. Using your one-liner to process each character one by one in a `for` loop is going to be very slow compared to a `re.sub()` approach (which can scan text in a C loop). – Martijn Pieters Nov 01 '16 at 18:11
Remark: `struct.unpack` this way makes it work for exactly one emoji character. For a string it's possible to use `x=array.array("H"); x.frombytes( );` – user202729 Oct 12 '21 at 06:19

Python: Find equivalent surrogate pair from non-BMP unicode char

2 Answers2

Linked