I'm of the opinion that "user perceived character" (henceforth UPC) iterator would be very useful in a unicode library. By UPC I mean the sense discussed in unicode standard annex 29, which is what a user perceives as a character, but might be represented in unicode as a codepoint or a grapheme-cluster. Since I typically work with latin languages, I always come up with examples like "I want to handle ü as one UPC, regardless of whether the UPC is a grapheme cluster, or a single codepoint".
Colleagues who are against a UPC iterator (or grapheme cluster iterator, take your pick) counter "You can normalize to NFC, and then use codepoint iteration", and "there is no use case for grapheme cluster iteration".
I keep thinking of latin-centric use cases, which maybe don't translate well to the unicode universe -- like I'm doing terminal output, I want to pad a column to N column widths, so I want to know how many UPCs are in a string...
I think what I want to know is:
- Are there meaningful grapheme clusters which can't be normalized to a single codepoint? Are there any that are likely to occur among western users? I'm assuming Korean or Arabic are cases of this, but I have to admit to total ignorance there.
- Do any other languages provide UPC/grapheme cluster iteration/operations? Is there any kind of advice from the Unicode specification?