3

There are a few ways to get the list of all Unicode characters' names: for example using Python module unicodedata, as explained in List of unicode character names, or using the website: https://unicode.org/charts/charindex.html but here it's incomplete, and you have to open and parse PDF to find the names.

But what is the official source / repository of all Unicode character names? (such that if a new character is added, the list is updated, so I'm looking for the initial source for these names, in a machine readable format).

I'm looking for a list with just code point and name, in CSV or any other format:

code   character name
...
0102   LATIN CAPITAL LETTER A WITH BREVE
0103   LATIN SMALL LETTER A WITH BREVE
...
Basj
  • 41,386
  • 99
  • 383
  • 673
  • What has this to do with "python", "string" and "utf-8"? – AmigoJack Dec 05 '20 at 16:26
  • @AmigoJack I initially wanted to use `unicodedata` https://docs.python.org/3/library/unicodedata.html, as mentioned in the question, but you're right this aspect is secondary. – Basj Dec 05 '20 at 16:28
  • How about editing your question so `unicodedata` links to Python (because it can mean [something different](http://www.unicode.org/L2/L1999/UnicodeData.html)) and removing the other two tags? I came here for "utf-8" just to find out the encoding is nowhere involved. – AmigoJack Dec 05 '20 at 16:33

1 Answers1

6

The official source for the actual character data (which includes the character names and many, many other details) is the Unicode Character Database.

The latest version of the data files can be accessed via http://www.unicode.org/Public/UCD/latest/.

Names specifically can be found in the files NamesList.txt. The format of that file is described here.

This is the list in CSV format: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

Basj
  • 41,386
  • 99
  • 383
  • 673
Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 1
    The official names are in `UnicodeData.txt`, and much easier to parse. OTOH your files contain other names (from `NameAliases.txt`) which are all "official" and in same namespace. – Giacomo Catenazzi Dec 07 '20 at 09:23
  • This CSV file contains 34627 lines. Yet, Wikipedia claims there are 144697 characters in Unicode. It's also backed by official page - https://www.unicode.org/versions/stats/charcountv14_0.html – Ginden Oct 12 '21 at 16:47
  • 1
    @Ginden UnicodeData.txt doesn't include the "Unihan" CJK data for Chinese, Japanese & Korean characters. This is deployed separately in [Unihan.zip](https://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip). – moon Mar 21 '22 at 10:40
  • @moon This explains disrepancy, but there are **39493** characters in "Alphabetics, Symbols". `39493-34627 = 4866` – Ginden Mar 22 '22 at 00:00