8

It appears, based on a urwid example that u'\N{HYPHEN BULLET} will create a unicode character that is a hyphen intended for a bullet.

The names for unicode characters seem to be defined at fileformat.info and some element of using Unicode in Python appears in the howto documentation. Though there is no mention of the \N{} syntax.

If you pull all these docs together you get the idea that the constant u"\N{HYPHEN BULLET}" creates a ⁃

However, this is all a theory based on pulling all this data together. I can find no documentation for "\N{} in the Python docs.

My question is whether my theory of operation is correct and whether it is documented anywhere?

Ray Salemi
  • 5,247
  • 4
  • 30
  • 63
  • The names are part of the [Unicode standard](https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt), also see the duplicate. – Martijn Pieters Dec 13 '20 at 11:29

3 Answers3

6

Not every gory detail can be found in a how-to. The table of escape sequences in the reference manual includes:

Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database (Unicode only)

Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56
6

You are correct that u"\N{CHARACTER NAME} produces a valid unicode character in Python.

It is not documented much in the Python docs, but after some searching I found a reference to it on effbot.org

http://effbot.org/librarybook/ucnhash.htm

The ucnhash module

(Implementation, 2.0 only) This module is an implementation module, which provides a name to character code mapping for Unicode string literals. If this module is present, you can use \N{} escapes to map Unicode character names to codes.

In Python 2.1, the functionality of this module was moved to the unicodedata module.

Checking the documentation for unicodedata shows that the module is using the data from the Unicode Character Database.

unicodedata — Unicode Database

This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 9.0.0.

The full data can be found at: https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt

The data has the structure: HEXVALUE;CHARACTER NAME;etc.. so you could use this data to look up characters.

For example:

# 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
>>> u"\N{LATIN CAPITAL LETTER A}"
'A'

# FF7B;HALFWIDTH KATAKANA LETTER SA;Lo;0;L;<narrow> 30B5;;;;N;;;;;
>>> u"\N{HALFWIDTH KATAKANA LETTER SA}"
'サ'
alxwrd
  • 2,320
  • 16
  • 28
2

The \N{} syntax is documented in the Unicode HOWTO, at least.

The names are documented in the Unicode standard, such as:

http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt

The unicodedata module can look up a name for a character:

>>> import unicodedata as ud
>>> ud.name('A')
'LATIN CAPITAL LETTER A'
>>> print('\N{LATIN CAPITAL LETTER A}')
A
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251