I've been puzzling over Technical Standard 51's Annex C and Annex 29 on Unicode Text Segmentation as well as the Unicode Grapheme Break Test data file and it appears that the definition of a cluster in Annex 29 does not cover the sequence tag_base tag_spec+ tag_end which means that characters built as Emoji tag sequences like will be treated as 7 graphemes by the Annex 29 algorithm rather than a single grapheme as one would expect.
I understand that there is flexibility in precisely which sequences are rendered by an implementation, but it seems like the correct behavior would be to treat all instances of syntactically valid Emoji tag sequences as single graphemes for cluster analysis rather than to break up characters built from tag sequences into multiple graphemes.
Edited to add:
- Is this an oversight in annex 29 on text segmentation?
- Should an implementation of text segmentation treat a tag sequence like as a single grapheme or as seven graphemes?