2

I am trying to use the encode method of python strings to return the unicode escape codes for characters, like this:

>>> print( 'ф'.encode('unicode_escape').decode('utf8') )
\u0444

This works fine with non-ascii characters, but for ascii characters, it just returns the ascii characters themselves:

>>> print( 'f'.encode('unicode_escape').decode('utf8') )
f

The desired output would be \u0066. This script is for pedagogical purposes.

How can I get the unicode hex codes for ALL characters?

reynoldsnlp
  • 1,072
  • 1
  • 18
  • 45
  • what else do you expect `'f`' to return? – Moinuddin Quadri Feb 06 '17 at 21:21
  • You can't. `unicode_escape` will never escape printable ASCII characters. What are you trying to do here? In other words, what is the *actual goal*? – Martijn Pieters Feb 06 '17 at 21:22
  • @MartijnPieters See edits above. The script is for pedagogical purposes, and the output for `f` would be `\u0066`. – reynoldsnlp Feb 06 '17 at 21:28
  • So what about non-BMP codepoints? Should the output be Python string-literal compatible, or are you aiming for, say, JSON-compatible output? This matters; Python would use `\Uhhhhhhh` (8 hex digits), JSON would use a UTF-16 surrogate pair. And if you aim for Python compatibility, why not `\xhh` for Latin-1 bytes? – Martijn Pieters Feb 06 '17 at 21:31
  • @MartijnPieters For my purposes, I can assume that everything will be in the BMP. I am aiming for 4-digit codes compatible with python string literals. – reynoldsnlp Feb 06 '17 at 21:46

2 Answers2

5

ord can be used for this, there is no need for encoding/decoding at all:

>>> '"\\U{:08x}"'.format(ord('f'))  # ...or \u{:04x} if you prefer
'"\\U00000066"'
>>> eval(_)
'f'
wim
  • 338,267
  • 99
  • 616
  • 750
1

You'd have to do so manually; if you assume that all your input is within the Unicode BMP, then a straightforward regex will probably be fastest; this replaces every character with their \uhhhh escape:

import re

def unicode_escaped(s, _pattern=re.compile(r'[\x00-\uffff]')):
    return _pattern.sub(lambda m: '\\u{:04x}'.format(
        ord(m.group(0))), s)

I've explicitly limited the pattern to the BMP to gracefully handle non-BMP points.

Demo:

>>> print(unicode_escaped('foo bar ф'))
\u0066\u006f\u006f\u0020\u0062\u0061\u0072\u0020\u0444
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343