List of unicode character names

Question

In Python I can print a unicode character by name (e.g. print(u'\N{snowman}')). Is there a way I get get a list of all valid names?

Beware that if they have a different version of Python, the game may backfire on you: see [Martijn Pieters' answer below](http://stackoverflow.com/a/30302840/2564301). — Jongware, May 18 '15 at 12:21

Martijn Pieters · Accepted Answer · 2020-12-13T11:24:28.830

23

Every codepoint has a name, so you are effectively asking for the Unicode standard list of codepoint names (as well as the *list of name aliases, supported by Python 3.3 and up).

Each Python version supports a specific version of the Unicode standard; the unicodedata.unidata_version attribute tells you which one for a given Python runtime. The above links lead to the latest published Unicode version, replace UCD/latest in the URLs with the value of unicodedata.unidata_version for your Python version.

Per codepoint, the unicodedata.name() function can tell you the official name, and unicodedata.lookup() gives you the inverse (name to codepoint).

edited Dec 13 '20 at 11:24

answered May 18 '15 at 12:08

Martijn Pieters

1,048,767
296
4,058
3,343

Are functions `name` and `lookup` really inverse? Indeed, `name(lookup('space'))` returns `SPACE`. But `lookup('escape')` returns expected value and `name(lookup('escape'))` raises `ValueError: no such name`. – Jeyekomon Jul 28 '22 at 09:25
1

@Jeyekomon not all Unicode codepoints have a name; `escape` is an alias instead. `lookup()` takes names and aliases (and sequences) but `name()` only ever returns the official name. It’s mostly the control codes like escape that don’t have a name. Note that `space` is an alias, names are always uppercase. Wikipedia has a [nice overview of what doesn’t have a name](https://en.wikipedia.org/wiki/Unicode_character_property#Name). – Martijn Pieters Aug 13 '22 at 11:45

score 5 · Answer 2 · edited Apr 24 '20 at 11:51

If you want a list of all unicode character names, consider downloading the Unicode Character Database.

It is included in the base repositories of many linux distributions (ex. "unicode-ucd" on RHEL).

The package includes NamesList.txt, which contains the exhaustive list of unicode character names.

Caution: NamesList.txt need some times to be downloaded (size > 1.5 MB).

Example:

21FE    RIGHTWARDS OPEN-HEADED ARROW
21FF    LEFT RIGHT OPEN-HEADED ARROW
@@  2200    Mathematical Operators  22FF
@@+
@       Miscellaneous mathematical symbols
2200    FOR ALL
    = universal quantifier
2201    COMPLEMENT
    x (latin letter stretched c - 0297)
2202    PARTIAL DIFFERENTIAL
2203    THERE EXISTS
    = existential quantifier
2204    THERE DOES NOT EXIST
    : 2203 0338
2205    EMPTY SET
    = null set
    * used in linguistics to indicate a null morpheme or phonological "zero"
    x (latin capital letter o with stroke - 00D8)
    x (diameter sign - 2300)
    ~ 2205 FE00 zero with long diagonal stroke overlay form

score 2 · Answer 3 · answered Sep 15 '17 at 17:32

2

Yes there is a way. Going through all existing code points and calling unicodedata.name() on each of them. Like this:

names = []
for c in range(0, 0x10FFFF + 1):
    try:
        names.append(unicodedata.name(c))
    except KeyError:
        pass
# Do something with names

answered Sep 15 '17 at 17:32

nitely

2,208
1
22
23

At least in Python 3, it should be `except ValueError` instead of `except KeyError`. https://docs.python.org/3/library/unicodedata.html#unicodedata.name – Dominique Unruh Jun 02 '22 at 12:25

score 1 · Answer 4 · answered May 18 '15 at 12:07

1

For a given codepoint, you can use unicodedata.name. To get them all, you can work through all the billions to see which have such names.

answered May 18 '15 at 12:07

Mike Graham

73,987
14
101
130

3

Not billions. The standard isn't **that** big. Yet. Unicode 7.0 contains 112,804. – Martijn Pieters May 18 '15 at 12:08
2

There aren't billions of names, but there are billions of potential codepoints to work through and check if we march through naively. – Mike Graham May 18 '15 at 12:11
8

There are (and forever will be) exactly 1,114,112 potential code points. You'd have to be extremely naïve to walk the entire 32-bit space. – 一二三 May 18 '15 at 13:15

score 1 · Answer 5 · answered Nov 12 '19 at 13:29

1

Just print them all:

import unicodedata 

for i in range(0x110000): 
    character = chr(i) 
    name = unicodedata.name(character, "") 
    if len(name) > 0: 
        print(f"{i:6} | 0x{i:04X} | {character} | {name}")

answered Nov 12 '19 at 13:29

Stan

250
2
9

score 0 · Answer 6 · answered Jan 25 '19 at 11:48

If you want to insert a unicode character by name, but don't know the name. Here is how you get an easy overview of unicode character names.

On Windows

Open "Character Map" (search for charmap.exe and run it).
Select any common Microsoft font (these tend to have a wide variety of unicode characters defined).
Click on any character on the map to get its Unicode Character Name.

On Mac it's called "Character Palette" and found under System Preferences, "International -> Input" or "Language & Text -> Input Sources" by ticking the box next to "Character Palette".

score 0 · Answer 7 · answered Jan 02 '20 at 15:38

0

my one liner, just for my own reference ;p

import unicodedata
names = [unicodedata.name(chr(c)) for c in range(0, 0x10FFFF+1) if unicodedata.name(chr(c), None)]

answered Jan 02 '20 at 15:38

pna

5,651
3
22
37

List of unicode character names

7 Answers7

Linked

Related