How can I get python ''.encode('unicode_escape') to return escape codes for ascii?

Question

I am trying to use the encode method of python strings to return the unicode escape codes for characters, like this:

>>> print( 'ф'.encode('unicode_escape').decode('utf8') )
\u0444

This works fine with non-ascii characters, but for ascii characters, it just returns the ascii characters themselves:

>>> print( 'f'.encode('unicode_escape').decode('utf8') )
f

The desired output would be \u0066. This script is for pedagogical purposes.

How can I get the unicode hex codes for ALL characters?

You can't. `unicode_escape` will never escape printable ASCII characters. What are you trying to do here? In other words, what is the *actual goal*? — Martijn Pieters, Feb 06 '17 at 21:22
@MartijnPieters See edits above. The script is for pedagogical purposes, and the output for `f` would be `\u0066`. — reynoldsnlp, Feb 06 '17 at 21:28
So what about non-BMP codepoints? Should the output be Python string-literal compatible, or are you aiming for, say, JSON-compatible output? This matters; Python would use `\Uhhhhhhh` (8 hex digits), JSON would use a UTF-16 surrogate pair. And if you aim for Python compatibility, why not `\xhh` for Latin-1 bytes? — Martijn Pieters, Feb 06 '17 at 21:31
@MartijnPieters For my purposes, I can assume that everything will be in the BMP. I am aiming for 4-digit codes compatible with python string literals. — reynoldsnlp, Feb 06 '17 at 21:46

wim · Accepted Answer · 2022-05-22T01:47:02.247

5

ord can be used for this, there is no need for encoding/decoding at all:

>>> '"\\U{:08x}"'.format(ord('f'))  # ...or \u{:04x} if you prefer
'"\\U00000066"'
>>> eval(_)
'f'

edited May 22 '22 at 01:47

answered Feb 06 '17 at 21:28

wim

338,267
99
616
750

Thanks! The combination of `ord` and the `x` specification for hex format seems to work perfectly. – reynoldsnlp Feb 06 '17 at 21:56

score 1 · Answer 2 · answered Feb 06 '17 at 21:54

You'd have to do so manually; if you assume that all your input is within the Unicode BMP, then a straightforward regex will probably be fastest; this replaces every character with their \uhhhh escape:

import re

def unicode_escaped(s, _pattern=re.compile(r'[\x00-\uffff]')):
    return _pattern.sub(lambda m: '\\u{:04x}'.format(
        ord(m.group(0))), s)

I've explicitly limited the pattern to the BMP to gracefully handle non-BMP points.

Demo:

>>> print(unicode_escaped('foo bar ф'))
\u0066\u006f\u006f\u0020\u0062\u0061\u0072\u0020\u0444

How can I get python ''.encode('unicode_escape') to return escape codes for ascii?

2 Answers2

Related