44

I'm new to learning Unicode, and not sure how much I have to learn based on my ASCII background, but I'm reading the C# spec on rules for identifiers to determine what chars are permitted within Azure Table (which is directly based on the C# spec).

Where can I find a list of Unicode characters that fall into these categories:

  • letter-character: A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
  • combining-character: A Unicode character of classes Mn or Mc
  • decimal-digit-character: A Unicode character of the class Nd
  • connecting-character: A Unicode character of the class Pc
  • formatting-character: A Unicode character of the class Cf
dtb
  • 213,145
  • 36
  • 401
  • 431
makerofthings7
  • 60,103
  • 53
  • 215
  • 448
  • 1
    @Hans Passant, that regex is just for a table name, not for an identifier that is used in property names such as PartitionKey and RowKey – makerofthings7 Sep 18 '10 at 16:58
  • 6
    @Hans: Very wrong comment. A-Za-Z covers 52 Unicode characters out of thousands that are permissible letters. – Timwi Sep 18 '10 at 17:04
  • @Timwi - oh, I did not know that. I quoted from the docs of course. – Hans Passant Sep 18 '10 at 17:25
  • 2
    @Hans: What docs? Link? The C# language specification clearly states “A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl; or A *unicode-escape-sequence* representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl” (§2.4.2 Identifiers). MakerOfThings7 even linked to this in the question. – Timwi Sep 18 '10 at 23:52
  • @Timwi - the OP knew what I meant. Good enough for me. – Hans Passant Sep 19 '10 at 01:03
  • Timwi: Looks like @HansPassant was misreading the docs linked by the OP, and misunderstood the "table names" section in the [Azure Table docs linked by the OP](https://learn.microsoft.com/en-us/rest/api/storageservices/fileservices/Understanding-the-Table-Service-Data-Model?redirectedfrom=MSDN) (which indeed is restricted to those few characters) as being relevant. Of course, the question is not about table names. – ShreevatsaR Nov 24 '16 at 19:53

5 Answers5

46

You can retrieve this information in an automated fashion from the official Unicode data file, UnicodeData.txt, which is published here:

This is a file with semicolon-separated values in each line. The third column tells you the character class of each character.

The benefit of this is that you can get the character name for each character, so you have a better idea of what it is than by just looking at the character itself (e.g. would you know what ბ is? That’s right, it’s Ban. In Georgian. :-))

Timwi
  • 65,159
  • 33
  • 165
  • 230
  • 2
    Nice! I can even search for chars within each category like this ";Cf;" – makerofthings7 Sep 19 '10 at 02:53
  • 2
    ...I never in my life thought Unicode was this complex. Seems like I have a lot of learning to do. – makerofthings7 Sep 19 '10 at 02:54
  • this is wierd. It says in the text file the "#" sign is a "Sc" category, MSDN says [the same](https://msdn.microsoft.com/en-us/library/system.globalization.unicodecategory(v=vs.110).aspx) but it is in fact a "OtherPunctuation" i.e. "Ps". Bug in .net 4.5.1? – Marcus Apr 06 '16 at 10:09
  • @Marcus: I think you were looking at the wrong line. It says Po for the # character, and Sc for the next one, the dollar sign. – Timwi Apr 06 '16 at 13:41
  • @Timwi ok I might have but my debugger still says # is Ps, not Po or any other category. Having # in such a large group as "OtherPunctuation" seems like a bug in .net to me. – Marcus Apr 07 '16 at 08:02
  • @Timwi https://en.wikipedia.org/wiki/Template:General_Category_(Unicode) this link says in "LO" category there are 121,212 characters, but in the link you posted there are only 16053 char under "LO", is there any reason? – oliver smith Jan 02 '19 at 07:33
  • 1
    @oliversmith The file I linked to does not list all of the Han (Chinese) characters separately, but as a range: `4E00;;Lo;0;L;;;;;N;;;;;` / `9FEF;;Lo;0;L;;;;;N;;;;;` – Timwi Jan 20 '19 at 15:53
  • The description of the file format is here (including the list of character ranges represented by only the start and end characters): http://www.unicode.org/L2/L1999/UnicodeData.html – Sandra Rossi Dec 23 '20 at 16:46
38

FileFormat.info has a list of Unicode characters by category:

http://www.fileformat.info/info/unicode/category/index.htm

Phil Ross
  • 25,590
  • 9
  • 67
  • 77
  • 5
    That site doesn't parse UnicodeData.txt right. It doesn't recognize ranges and doesn't understand Cn. So the categories Co, Cs, Lo, and Cn have the wrong counts. Other than that it's a cool site. – Yuvi Masory May 10 '11 at 03:52
16

You can, of course, use LINQ:

var charInfo = Enumerable.Range(0, 0x110000)
                         .Where(x => x < 0x00d800 || x > 0x00dfff)
                         .Select(char.ConvertFromUtf32)
                         .GroupBy(s => char.GetUnicodeCategory(s, 0))
                         .ToDictionary(g => g.Key);

foreach (var ch in charInfo[UnicodeCategory.LowercaseLetter])
{
    Console.Write(ch);
}

You can find a list of Unicode categories and their short names on MSDN, e.g., "Ll" is short for UnicodeCategory.LowercaseLetter.

dtb
  • 213,145
  • 36
  • 401
  • 431
  • How did you know to hard code those constants in? Where do they come from? – makerofthings7 Sep 18 '10 at 17:40
  • 2
    @MakerOfThings7: From the documentation of [Char.ConvertFromUtf32](http://msdn.microsoft.com/en-us/library/system.char.convertfromutf32.aspx). It throws an exception if its argument "is not a valid 21-bit Unicode code point ranging from U+0 through U+10FFFF, excluding the surrogate pair range from U+D800 through U+DFFF." – dtb Sep 18 '10 at 17:44
  • Linq is fun. +1 since I'm going to learn something from this. Also I think not all chars will render within "Console.write". Perhaps it's better for me to output these codes in a HTML page for IE to render? – makerofthings7 Sep 19 '10 at 02:56
  • @MakerOfThings7: Yes, the set of characters the Console can display is quite limited. Writing the characters to a HTML page is a good idea. – dtb Sep 19 '10 at 15:01
  • I'm convinced that char.GetUnicodeCategory gives incorrect results. U+0E33 should (imo) give me the result SpacingCombiningMark, but it returns 'OtherLetter'. This does not seem right to me. – Gusdor Nov 08 '12 at 15:07
  • @Gusdor The Unicode Standard says it is of type `OtherLetter` for sentence. You are perhaps looking for Grapheme Cluster Break, which I am not sure is available in C#. – NetMage Jan 08 '21 at 21:35
2

In the ANTLR lexer you can find Unicode character sets (LU, LL, LT, LM, and LO) in convenient range format.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ivan Kochurkin
  • 4,413
  • 8
  • 45
  • 80
2

https://www.compart.com/en/unicode/category is a pretty useful and easy-to-navigate site for browsing the categories. It is searchable and lists quite a lot of info on individual unicode characters.

b3000
  • 1,547
  • 1
  • 15
  • 27