In Java, how are Unicode chars and Java UTF-16 codepoints handled?

Question

I'm struggling with Unicode characters in Java 10.
I'm using the java.text.BreakIterator package. For this output:

myString="ab"  hex=0061d835dcde0062
myString.length()=4 
myString.codePointCount(0,s.length())=3
BreakIterator output:
    a    hex=0061           
        hex=d835dcde          
    b    hex=0062

Seems correct.

Using the same Java code, then with this output:

myString="G̲íl"  hex=0047033200ed006c  
myString.length()=4 
myString.codePointCount(0,s.length())=4
BreakIterator output:   
    G̲    hex=00470332  
    í    hex=00ed  
    l    hex=006c

Seems correct too, EXCEPT for the codePointCount=4.
Why isn't it 3, and is there a means of getting a 3 value without using BreakIterator?

My goal is to determine if all (output) chars of a string are 16-bit, or are surrogate or combining chars present?

In the last example, The G_ is not displayed correctly on this page.....it appears as single character G̲ as in the defining String clause. Don't know how to correct this apparent typo. — Bcwilmot, Mar 14 '19 at 22:21

score 6 · Answer 1 · answered Mar 14 '19 at 22:23

"G̲íl" is four code points: U+0047, U+0332, U+00ED, U+006C.

U+0332 is a combining character, but it is a separate code point. That's not the same as your first example, which requires using a surrogate pair (2 UTF-16 code units) to represent U+1D4DE - but the latter is still a single code point.

BreakIterator finds boundaries in text - the two code points here that are combined don't have a boundary between them in that sense. From the documentation:

Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation through character strings, regardless of how the character is stored.

So I think everything is working correctly here.

score 1 · Answer 2 · 2019-03-14T22:43:34.393

A codepoint corresponds to one Unicode character.

Java represents Unicode in UTF-16, i.e., in 16-bit units. Characters with codepoint values larger than U+FFFF are represented by a pair of 'surrogate characters', as in your first example. Thus the first result of 3.

In the second case, you have an example that is not a single Unicode character. It is one character, LETTER G, followed by another character COMBINING CHARACTER LOW LINE. That is two codepoints per the definition. Thus the second result of 4.

In general, Unicode has tables of character attributes (I'm not sure if I have the right word here) and it is possible to find out that one of your codepoints is a combining character.

Take a look at the Character class. getType(character) will tell you if a codepoint is a combining character or a surrogate.

In Java, how are Unicode chars and Java UTF-16 codepoints handled?

2 Answers2