12

In Python I can print a unicode character by name (e.g. print(u'\N{snowman}')). Is there a way I get get a list of all valid names?

Miki Tebeka
  • 13,428
  • 4
  • 37
  • 49

7 Answers7

23

Every codepoint has a name, so you are effectively asking for the Unicode standard list of codepoint names (as well as the *list of name aliases, supported by Python 3.3 and up).

Each Python version supports a specific version of the Unicode standard; the unicodedata.unidata_version attribute tells you which one for a given Python runtime. The above links lead to the latest published Unicode version, replace UCD/latest in the URLs with the value of unicodedata.unidata_version for your Python version.

Per codepoint, the unicodedata.name() function can tell you the official name, and unicodedata.lookup() gives you the inverse (name to codepoint).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Are functions `name` and `lookup` really inverse? Indeed, `name(lookup('space'))` returns `SPACE`. But `lookup('escape')` returns expected value and `name(lookup('escape'))` raises `ValueError: no such name`. – Jeyekomon Jul 28 '22 at 09:25
  • 1
    @Jeyekomon not all Unicode codepoints have a name; `escape` is an alias instead. `lookup()` takes names and aliases (and sequences) but `name()` only ever returns the official name. It’s mostly the control codes like escape that don’t have a name. Note that `space` is an alias, names are always uppercase. Wikipedia has a [nice overview of what doesn’t have a name](https://en.wikipedia.org/wiki/Unicode_character_property#Name). – Martijn Pieters Aug 13 '22 at 11:45
5

If you want a list of all unicode character names, consider downloading the Unicode Character Database.

It is included in the base repositories of many linux distributions (ex. "unicode-ucd" on RHEL).

The package includes NamesList.txt, which contains the exhaustive list of unicode character names.

Caution: NamesList.txt need some times to be downloaded (size > 1.5 MB).

Example:

21FE    RIGHTWARDS OPEN-HEADED ARROW
21FF    LEFT RIGHT OPEN-HEADED ARROW
@@  2200    Mathematical Operators  22FF
@@+
@       Miscellaneous mathematical symbols
2200    FOR ALL
    = universal quantifier
2201    COMPLEMENT
    x (latin letter stretched c - 0297)
2202    PARTIAL DIFFERENTIAL
2203    THERE EXISTS
    = existential quantifier
2204    THERE DOES NOT EXIST
    : 2203 0338
2205    EMPTY SET
    = null set
    * used in linguistics to indicate a null morpheme or phonological "zero"
    x (latin capital letter o with stroke - 00D8)
    x (diameter sign - 2300)
    ~ 2205 FE00 zero with long diagonal stroke overlay form
schlebe
  • 3,387
  • 5
  • 37
  • 50
ToBeReplaced
  • 3,334
  • 2
  • 26
  • 42
2

Yes there is a way. Going through all existing code points and calling unicodedata.name() on each of them. Like this:

names = []
for c in range(0, 0x10FFFF + 1):
    try:
        names.append(unicodedata.name(c))
    except KeyError:
        pass
# Do something with names
nitely
  • 2,208
  • 1
  • 22
  • 23
  • At least in Python 3, it should be `except ValueError` instead of `except KeyError`. https://docs.python.org/3/library/unicodedata.html#unicodedata.name – Dominique Unruh Jun 02 '22 at 12:25
1

For a given codepoint, you can use unicodedata.name. To get them all, you can work through all the billions to see which have such names.

Mike Graham
  • 73,987
  • 14
  • 101
  • 130
  • 3
    Not billions. The standard isn't **that** big. Yet. Unicode 7.0 contains 112,804. – Martijn Pieters May 18 '15 at 12:08
  • 2
    There aren't billions of names, but there are billions of potential codepoints to work through and check if we march through naively. – Mike Graham May 18 '15 at 12:11
  • 8
    There are (and forever will be) exactly 1,114,112 potential code points. You'd have to be extremely naïve to walk the entire 32-bit space. – 一二三 May 18 '15 at 13:15
1

Just print them all:

import unicodedata 

for i in range(0x110000): 
    character = chr(i) 
    name = unicodedata.name(character, "") 
    if len(name) > 0: 
        print(f"{i:6} | 0x{i:04X} | {character} | {name}") 
Stan
  • 250
  • 2
  • 9
0

If you want to insert a unicode character by name, but don't know the name. Here is how you get an easy overview of unicode character names.

On Windows

  1. Open "Character Map" (search for charmap.exe and run it).
  2. Select any common Microsoft font (these tend to have a wide variety of unicode characters defined).
  3. Click on any character on the map to get its Unicode Character Name.

On Mac it's called "Character Palette" and found under System Preferences, "International -> Input" or "Language & Text -> Input Sources" by ticking the box next to "Character Palette".

Kristian L
  • 71
  • 1
  • 5
0

my one liner, just for my own reference ;p

import unicodedata
names = [unicodedata.name(chr(c)) for c in range(0, 0x10FFFF+1) if unicodedata.name(chr(c), None)]
pna
  • 5,651
  • 3
  • 22
  • 37