1

I've been puzzling over Technical Standard 51's Annex C and Annex 29 on Unicode Text Segmentation as well as the Unicode Grapheme Break Test data file and it appears that the definition of a cluster in Annex 29 does not cover the sequence tag_base tag_spec+ tag_end which means that characters built as Emoji tag sequences like will be treated as 7 graphemes by the Annex 29 algorithm rather than a single grapheme as one would expect.

I understand that there is flexibility in precisely which sequences are rendered by an implementation, but it seems like the correct behavior would be to treat all instances of syntactically valid Emoji tag sequences as single graphemes for cluster analysis rather than to break up characters built from tag sequences into multiple graphemes.

Edited to add:

  1. Is this an oversight in annex 29 on text segmentation?
  2. Should an implementation of text segmentation treat a tag sequence like as a single grapheme or as seven graphemes?
Don Hosek
  • 981
  • 6
  • 23
  • I'm not clear what kind of answer (solution) you are seeking here. Are you asking whether there is an error/oversight/omission on the part of the authors of Technical Standard #51 with respect to emoji tag sequences? If so, I'm not sure that is on topic for SO. Or am I misunderstanding your question completely? – skomisa Mar 23 '22 at 04:29
  • It seems more a rant than a question. But are you sure? *Do not break within emoji flag sequences. * in GB12/GB13 rules – Giacomo Catenazzi Mar 23 '22 at 07:39
  • is not a flag sequence though, but a *tag* sequence. If you look at the Unicode representation of vs, e.g., , they are different kinds of sequences. The UK flag is ri-U ri-K, the scottish flag is black-flag tag-u tag-k tag-s tag-c tag-t tag-end – Don Hosek Mar 23 '22 at 09:49
  • @skomisa I guess it's is there an oversight and whether the correct behavior for grapheme clustering would be to treat something like as a single grapheme or 7 graphemes. I'll edit the question to make this clear. – Don Hosek Mar 23 '22 at 09:51
  • Write to Unicode mailing list. They included flag sequences, emoji zwj sequences, but not the tag sequences. It seems a bug. – Giacomo Catenazzi Mar 23 '22 at 10:46
  • @GiacomoCatenazzi CharlotteBuff had the answer. – Don Hosek Mar 23 '22 at 13:28
  • Some comments: [1] Part of the sequence for the Scottish flag should be _tag-g tag-b_ rather than _tag-u tag-k_. Similarly, the sequence for the UK flag is ri-G ri-B. [2] I originally read your question on Win 10, where the Scottish flag in your title was confusingly rendered as [Waving Black Flag](https://emojipedia.org/emoji/%F0%9F%8F%B4/). I later saw your question on my iPad where the Scottish flag was displayed instead, and it made more sense. [Blame Microsoft](https://stackoverflow.com/q/62729729/2985643), but be aware that flags might not always be rendered correctly for Windows users. – skomisa Mar 24 '22 at 07:19

1 Answers1

5

The invisible tag characters belong to the grapheme cluster break category Extend, meaning they behave just like a combining mark. There doesn’t need to be a special case for handling emoji tag sequences because grapheme cluster breaking rule GB9 will simply take care of them.

CharlotteBuff
  • 3,389
  • 1
  • 16
  • 18
  • 1
    Data is in PropList.txt as `Other_Grapheme_Extend`. TR44 derives the `Grapheme_Extend` from `Other_Grapheme_Extend` (and other properties). TR29 derives `Extend` (which it is not `Extender` of TR44) from ` Grapheme_Extend` (and other data). So it is really a confusing topic, and for sure: difficult to navigate. – Giacomo Catenazzi Mar 23 '22 at 14:01
  • Part of my confusion, it turned out is that there are two different kinds of flag sequences. National flags are encoded differently from regional flags. – Don Hosek Jan 10 '23 at 03:12