2

I'm of the opinion that "user perceived character" (henceforth UPC) iterator would be very useful in a unicode library. By UPC I mean the sense discussed in unicode standard annex 29, which is what a user perceives as a character, but might be represented in unicode as a codepoint or a grapheme-cluster. Since I typically work with latin languages, I always come up with examples like "I want to handle ü as one UPC, regardless of whether the UPC is a grapheme cluster, or a single codepoint".

Colleagues who are against a UPC iterator (or grapheme cluster iterator, take your pick) counter "You can normalize to NFC, and then use codepoint iteration", and "there is no use case for grapheme cluster iteration".

I keep thinking of latin-centric use cases, which maybe don't translate well to the unicode universe -- like I'm doing terminal output, I want to pad a column to N column widths, so I want to know how many UPCs are in a string...

I think what I want to know is:

  1. Are there meaningful grapheme clusters which can't be normalized to a single codepoint? Are there any that are likely to occur among western users? I'm assuming Korean or Arabic are cases of this, but I have to admit to total ignorance there.
  2. Do any other languages provide UPC/grapheme cluster iteration/operations? Is there any kind of advice from the Unicode specification?
一二三
  • 21,059
  • 11
  • 65
  • 74
Spacemoose
  • 3,856
  • 1
  • 27
  • 48

2 Answers2

2

It's unclear how your questions are not answered by UAX #29:

  1. There are many such grapheme clusters, even for languages that only use the Latin alphabet as not all combining marks have compositions with all other letters/forms—for example, the gaps in this table on Wikipedia. Table 1a in UAX #29 has several non-Latin examples.

  2. This is the purpose of UAX #29: to generalise grapheme cluster operations to all languages that are supported in Unicode.

一二三
  • 21,059
  • 11
  • 65
  • 74
  • I just reread UAX #15... Are you referring to section 5 "Composite Exclusion Table"? I have to admit I have trouble taking the content of the section and applying it to the languages I know. I suppose I am asking for cultural knowledge -- how commonly will I need to be aware of grapheme clusters? Is it reasonable to tell my customers we don't support them? There's an element in my company leaning towards ignoreing their presence until they bite us. I'd like to know the risks, and have compelling arguments at hand, if they exist. – Spacemoose Aug 13 '15 at 13:39
  • The wikepedia table seems to be what I 'm looking for r.e. Latin languages. Can you or anyone else tell me how commonly these excluded clusters are, and in which countries I'm likely to encounter them? – Spacemoose Aug 13 '15 at 13:42
  • Given that the algorithm for supporting grapheme clusters is well-known and implemented in any decent Unicode library, *not* supporting them would seem to be more difficult. – 一二三 Aug 13 '15 at 14:18
0

(1) Are there any that are likely to occur among western users?

(thumbs up + light skin tone). Can occur: anywhere in the Northern Hemisphere on an application which has easy access to emojis.

(2) Do any other languages provide UPC/grapheme cluster iteration/operations?

The unicode_segmentation crate (library) for Rust.

Guildenstern
  • 2,179
  • 1
  • 17
  • 39