Why are codepoints in the block CJK UNIFIED IDEOGRAPHS EXTENSION B not named according to the group pattern

Question

In the Java standard library, Character.getName(0x2000A) returns "CJK UNIFIED IDEOGRAPHS EXTENSION B 2000A" (in java 11, 16 and 17, using unicode version 10 and unicode version 13), while I expected "CJK UNIFIED IDEOGRAPHS-2000A"

The result surprised to me, because the codepoint is part of a character group with name "CJK UNIFIED IDEOGRAPHS-#", and such blocks usually derive their name from the group name with the # replaced with the codepoint number. This is the case, for example, with codepoint u+FA21, which returns name "CJK COMPATIBILITY IDEOGRAPH-FA21".

This rule is explained in Unicode® Standard Annex #42 paragraph 4.4.2

if a code point has the attribute na (either directly or by inheritence from an enclosing group), then occurrences of the character # in the name are to be interpreted as the value of the code point.

It seems the character is named through the "category rule" of the JDK, where the name of the character is given by taking the name of the block of the codepoint in upper case, and appending the codepoint.

Why does the jdk return the block name for codepoint u+2000A treating it apparently differently than u+FA21

unicode group and block information taken from https://www.unicode.org/Public/UCD/latest/ucdxml/ucd.all.grouped.zip — Martijn, Sep 01 '21 at 16:44
What Java version are you asking about so we can pin down the unicode version targeted? — ekrich, Sep 01 '21 at 16:55
So that means the include version is unicode 10 - https://docs.oracle.com/en/java/javase/11/intl/internationalization-enhancements1.html — ekrich, Sep 01 '21 at 17:05
Same with java 16, which uses unicode 13, which is the latest version (and the one of the linked character information file) — Martijn, Sep 01 '21 at 17:05
@user16320675 u+2000A should have a name: I clarified the question with the annex where it's defined. It is indeed in the block `CJK UNIFIED IDEOGRAPHS EXTENSION B`, but its name is coming from the group it's defined in. — Martijn, Sep 01 '21 at 17:35
It seems to explain here that it is the block plus the code point. https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#getName-int- — ekrich, Sep 01 '21 at 17:44
it does have a name: see unicode.org/Public/UCD/latest/ucdxml/ucd.all.grouped.zip -- the group is defined on line 79572 with property `na="CJK UNIFIED IDEOGRAPH-#"` and the codepoint itself a few lines below it, on line 79583. — Martijn, Sep 01 '21 at 17:46
@user16320675 no, the name is on the group above it, I cite the annex that defines the pattern in the question: "if a code point has the attribute na (either directly or by inheritence from an enclosing group), then occurrences of the character # in the name are to be interpreted as the value of the code point.". The "an" attribute is "CJK UNIFIED IDEOGRAPH-#" so the characters name is "CJK UNIFIED IDEOGRAPH-2000A" — Martijn, Sep 01 '21 at 17:53
If they did that, that would be a bug. But it would be good to know if that's actually what they did, or that they followed some other procedure to arrive at the names they use. — Martijn, Sep 01 '21 at 18:10
Yes, the java implementation deviate from the Unicode data file and spec (see also https://www.unicode.org/versions/Unicode13.0.0/ch04.pdf page 182 NR2 and table 4.2). This question is an asking *why* that is. If you think it's because of a bug in the Java implementation, that's a fine answer, but not really as a comment. — Martijn, Sep 01 '21 at 19:50
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236705/discussion-between-martijn-and-user16320675). — Martijn, Sep 02 '21 at 20:40

Why are codepoints in the block CJK UNIFIED IDEOGRAPHS EXTENSION B not named according to the group pattern

0 Answers0