Details of Unicode Names \N Documented?

Question

It appears, based on a urwid example that u'\N{HYPHEN BULLET} will create a unicode character that is a hyphen intended for a bullet.

The names for unicode characters seem to be defined at fileformat.info and some element of using Unicode in Python appears in the howto documentation. Though there is no mention of the \N{} syntax.

If you pull all these docs together you get the idea that the constant u"\N{HYPHEN BULLET}" creates a ⁃

However, this is all a theory based on pulling all this data together. I can find no documentation for "\N{} in the Python docs.

My question is whether my theory of operation is correct and whether it is documented anywhere?

The names are part of the [Unicode standard](https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt), also see the duplicate. — Martijn Pieters, Dec 13 '20 at 11:29

Stop harming Monica · Answer 1 · 2017-11-29T16:17:08.163

6

Not every gory detail can be found in a how-to. The table of escape sequences in the reference manual includes:

Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database (Unicode only)

edited Nov 29 '17 at 16:17

answered Nov 29 '17 at 15:14

Stop harming Monica

12,141
1
36
56

alxwrd · Answer 2 · 2022-02-24T13:51:01.753

You are correct that u"\N{CHARACTER NAME} produces a valid unicode character in Python.

It is not documented much in the Python docs, but after some searching I found a reference to it on effbot.org

http://effbot.org/librarybook/ucnhash.htm

The ucnhash module

(Implementation, 2.0 only) This module is an implementation module, which provides a name to character code mapping for Unicode string literals. If this module is present, you can use \N{} escapes to map Unicode character names to codes.

In Python 2.1, the functionality of this module was moved to the unicodedata module.

Checking the documentation for unicodedata shows that the module is using the data from the Unicode Character Database.

unicodedata — Unicode Database

This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 9.0.0.

The full data can be found at: https://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt

The data has the structure: HEXVALUE;CHARACTER NAME;etc.. so you could use this data to look up characters.

For example:

# 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
>>> u"\N{LATIN CAPITAL LETTER A}"
'A'

# FF7B;HALFWIDTH KATAKANA LETTER SA;Lo;0;L;<narrow> 30B5;;;;N;;;;;
>>> u"\N{HALFWIDTH KATAKANA LETTER SA}"
'ｻ'

Seems the effbot link is dead, the blog is "in pause" and its content is not accessible. — Tshirtman, Feb 24 '22 at 13:26
@Tshirtman - thanks. I've updated the link to point to the wayback machine. — alxwrd, Feb 24 '22 at 13:53

score 2 · Answer 3 · answered Nov 29 '17 at 17:05

The \N{} syntax is documented in the Unicode HOWTO, at least.

The names are documented in the Unicode standard, such as:

http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt

The unicodedata module can look up a name for a character:

>>> import unicodedata as ud
>>> ud.name('A')
'LATIN CAPITAL LETTER A'
>>> print('\N{LATIN CAPITAL LETTER A}')
A

Details of Unicode Names \N Documented?

3 Answers3

The ucnhash module

unicodedata — Unicode Database