0

I'm thinking about using UTF-16 in an application. But I have some difficulties in understanding some key concepts. In particular the surrogates and combinig characters.

As I understand the surrogates are used for UTF-16 to allow encoding of codepoints that need more than 16 bits. So if I use surrogates my UTF-16 character needs 32 bits.

Combining characters allow alternative form for compability reasons to older encodings. So for example I can write the character ä also as a followed by ¨.

  • ä: U+00E4
  • a: U+0061
  • ◌̈:U+0308 (Combining diaeresis)

So if I use surrogates together with combinig characters it can happen, that my character needs 2 x 32 bits for encoding. This doesn't happen for my example of course. Since there are no surrogates involved. But could it happen with other characters?

woodtluk
  • 935
  • 8
  • 20
  • A glyph can be built from any number of combining characters, google "zalgo" for extreme examples. So the notion that you can store a glyph in 32 bits is already invalid. It is not a problem if you simply only consider storing coding units. The wonky stuff is for the text rendering engine to sort out. – Hans Passant Nov 01 '17 at 13:10
  • Thanks! I didn't know that it's possible to build a glyph from more than two combining characters. But it makes sense even thou most possible combinations don't make sense from a linguistics point of view. – woodtluk Nov 01 '17 at 13:43

0 Answers0