How can I tell if a Unicode code point is one complete printable glyph(or grapheme cluster)?

Question

Let's say there's a Unicode String object, and I want to print each Unicode character in that String one by one. In my simple test with very limited languages, I could successively achieve this just assuming one code point is always the same as one glyph.

But I know this is not the case, and the code logic above may easily cause unexpected results in some countries or languages.

So my question is, is there any way to tell if one Unicode code point is one complete printable glyph in Java or C#? If I have to write code in C/C++, that's fine too.

I googled for hours but all I got is about code units and code points. It's very easy to tell if a code unit is a part of a surrogate-pair but nothing about graphemes..

Could anyone point me in the right direction, please?

If using ICU, you want `BreakIterator::createCharacterInstance()` (C++ or Java) — Tavian Barnes, Aug 23 '18 at 22:08
@Tavian Barnes Thanks for the quick comment! If I'm android programming, can I simply access it without including any libraries? I heard android uses ICU internally. — Jenix, Aug 23 '18 at 22:12
Actually apparently it's part of the Java SDK too so just use https://developer.android.com/reference/java/text/BreakIterator.html#getCharacterInstance(java.util.Locale) — Tavian Barnes, Aug 23 '18 at 22:13

score 3 · Accepted Answer · answered Aug 24 '18 at 14:30

3

You're definitely right that a single glyph is often composed of more than one code point. For example, the letter é (e with acute accent) may be equivalently written \u00E9 or with a combining accent as \u0065\u0301. Unicode normalization cannot always merge things like this into one code point, especially if there are multiple combining characters. So you'll need to use some Unicode segmentation rules to identify the boundaries you want.

What you are calling a "printable glyph" is called a user-perceived character or (extended) grapheme cluster. In Java, the way to iterate over these is with BreakIterator.getCharacterInstance(Locale):

BreakIterator boundary = BreakIterator.getCharacterInstance(Locale.WHATEVER);
boundary.setText(yourString);
for (int start = boundary.first(), end = boundary.next();
        end != BreakIterator.DONE;
        start = end, end = boundary.next()) {
    String chunk = yourString.substring(start, end);
}

answered Aug 24 '18 at 14:30

Tavian Barnes

12,477
4
45
118

Thanks for the detailed answer :) This works great in my test. But what happens if BreakIterator.getCharacterInstance() method takes Locale.JAPAN but my text has Chinese, Japanese, and Korean characters? Setting a specific Locale here feels like I'm again working with ANSI code page things, which I hated so much... :( – Jenix Aug 24 '18 at 17:59
@Jenix People in different locales will answer the question "how many characters is this" differently for the same string. See http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries for some examples. For UI purposes it probably makes sense to use the user's current locale. If you need consistency between different locales it might make sense to use `Locale.ROOT`. – Tavian Barnes Aug 24 '18 at 18:07
As you said as for UIs, default Locale would make sense, but in my case, it's something like a web browser or a text viewer/editor, there's a chance where default locale is not appropriate. Anyways I need to read that page. Thanks for the link :) – Jenix Aug 24 '18 at 18:15

How can I tell if a Unicode code point is one complete printable glyph(or grapheme cluster)?

1 Answers1