How to convert some character into five digit unicode one in Python 3.3?

Question

I'd like to convert some character into five digit unicode on in Python 3.3. For example,

import re
print(re.sub('a', u'\u1D15D', 'abc' ))

but the result is different from what I expected. Do I have to put the character itself, not codepoint? Is there a better way to handle five digit unicode characters?

The `u''` literal in Python 3 is a no-op; just use `''` instead, that's already unicode. — Martijn Pieters, Jan 31 '13 at 11:50

Martijn Pieters · Accepted Answer · 2013-01-31T12:46:05.160

Python unicode escapes either are 4 hex digits (\uabcd) or 8 (\Uabcdabcd); for a codepoint beyond U+FFFF you need to use the latter (a capital U), make sure to left-fill with enough zeros:

>>> '\U0001D15D'
''
>>> '\U0001D15D'.encode('unicode_escape')
b'\\U0001d15d'

(And yes, the U+1D15D codepoint (MUSICAL SYMBOL WHOLE NOTE) is in the above example, but your browser font may not be able to render it, showing a place-holder glyph (a box or question mark) instead.

Because you used a \uabcd escape, you replaced a in abc with two characters, the codepoint U+1D15 (ᴕ, latin letter small capital ou), and the ASCII character D. Using a 32-bit unicode literal works:

>>> import re
>>> print(re.sub('a', '\U0001D15D', 'abc' ))
bc
>>> print(re.sub('a', u'\U0001D15D', 'abc' ).encode('unicode_escape'))
b'\\U0001d15dbc'

where again the U+1D15D codepoint could be displayed by your font as a placeholder glyph instead.

unutbu · Answer 2 · 2013-01-31T12:31:11.260

1

By the way, you do not need the re module for this. You could use str.translate:

>>> 'abc'.translate({ord('a'):'\U0001D15D'})
'bc'

edited Jan 31 '13 at 12:31

answered Jan 31 '13 at 11:54

unutbu

842,883
184
1,785
1,677

It was probably just an illustration, a short example to demonstrate the perceived problem. – Martijn Pieters Jan 31 '13 at 11:55

How to convert some character into five digit unicode one in Python 3.3?

2 Answers2

Linked