1

Is there any limit to the number of distinct graphemes that can be represented with a Unicode encoding such as UTF-8? Does, for example, the Unicode standard restrict the number of consecutive combining characters?

Anthony Faull
  • 17,549
  • 5
  • 55
  • 73

1 Answers1

2

The set of possible combinations of a character and combining marks after it is infinite (though only countably infinite ☺). The Unicode Standard says explicitly in clause 2.1 (in chapter 2): “All combining characters can be applied to any base character and can, in principle, be used with any script.” A combination of a letter and a diacritic can be used as a base character for another diacritic, and so on.

At a higher protocol level, as in a data format specification, you can of course impose limit e.g. on the number of consecutive combining marks. The Unicode Standard, however, does not set such restrictions.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • 1
    Note UAX#15 defines a [stream-safe](http://www.unicode.org/reports/tr15/#Stream_Safe_Text_Format) restriction which limits the number of combining characters, at which we are back in finite land (albeit still with a staggeringly enormous number of potential graphemes). But it's certainly valid to have a Unicode string which is not ‘stream-safe’. – bobince Aug 28 '13 at 09:18
  • 1
    @deceze, I specified that it is countably infinite, which means that it has the same cardinality as the set of natural numbers. – Jukka K. Korpela Aug 28 '13 at 09:35