1

In the Java standard library, Character.getName(0x2000A) returns "CJK UNIFIED IDEOGRAPHS EXTENSION B 2000A" (in java 11, 16 and 17, using unicode version 10 and unicode version 13), while I expected "CJK UNIFIED IDEOGRAPHS-2000A"

The result surprised to me, because the codepoint is part of a character group with name "CJK UNIFIED IDEOGRAPHS-#", and such blocks usually derive their name from the group name with the # replaced with the codepoint number. This is the case, for example, with codepoint u+FA21, which returns name "CJK COMPATIBILITY IDEOGRAPH-FA21".

This rule is explained in Unicode® Standard Annex #42 paragraph 4.4.2

if a code point has the attribute na (either directly or by inheritence from an enclosing group), then occurrences of the character # in the name are to be interpreted as the value of the code point.

It seems the character is named through the "category rule" of the JDK, where the name of the character is given by taking the name of the block of the codepoint in upper case, and appending the codepoint.

Why does the jdk return the block name for codepoint u+2000A treating it apparently differently than u+FA21

Martijn
  • 11,964
  • 12
  • 50
  • 96
  • 1
    unicode group and block information taken from https://www.unicode.org/Public/UCD/latest/ucdxml/ucd.all.grouped.zip – Martijn Sep 01 '21 at 16:44
  • What Java version are you asking about so we can pin down the unicode version targeted? – ekrich Sep 01 '21 at 16:55
  • adopt jdk 11.0.11 @ekrich – Martijn Sep 01 '21 at 16:57
  • So that means the include version is unicode 10 - https://docs.oracle.com/en/java/javase/11/intl/internationalization-enhancements1.html – ekrich Sep 01 '21 at 17:05
  • Same with java 16, which uses unicode 13, which is the latest version (and the one of the linked character information file) – Martijn Sep 01 '21 at 17:05
  • I edited the question to clarify – Martijn Sep 01 '21 at 17:07
  • @user16320675 u+2000A should have a name: I clarified the question with the annex where it's defined. It is indeed in the block `CJK UNIFIED IDEOGRAPHS EXTENSION B`, but its name is coming from the group it's defined in. – Martijn Sep 01 '21 at 17:35
  • It seems to explain here that it is the block plus the code point. https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#getName-int- – ekrich Sep 01 '21 at 17:44
  • it does have a name: see unicode.org/Public/UCD/latest/ucdxml/ucd.all.grouped.zip -- the group is defined on line 79572 with property `na="CJK UNIFIED IDEOGRAPH-#"` and the codepoint itself a few lines below it, on line 79583. – Martijn Sep 01 '21 at 17:46
  • @user16320675 no, the name is on the group above it, I cite the annex that defines the pattern in the question: "if a code point has the attribute na (either directly or by inheritence from an enclosing group), then occurrences of the character # in the name are to be interpreted as the value of the code point.". The "an" attribute is "CJK UNIFIED IDEOGRAPH-#" so the characters name is "CJK UNIFIED IDEOGRAPH-2000A" – Martijn Sep 01 '21 at 17:53
  • If they did that, that would be a bug. But it would be good to know if that's actually what they did, or that they followed some other procedure to arrive at the names they use. – Martijn Sep 01 '21 at 18:10
  • Yes, the java implementation deviate from the Unicode data file and spec (see also https://www.unicode.org/versions/Unicode13.0.0/ch04.pdf page 182 NR2 and table 4.2). This question is an asking *why* that is. If you think it's because of a bug in the Java implementation, that's a fine answer, but not really as a comment. – Martijn Sep 01 '21 at 19:50
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236705/discussion-between-martijn-and-user16320675). – Martijn Sep 02 '21 at 20:40

0 Answers0