I was reading a few questions on SO about Unicode and there were some comments I didn't fully understand, like this one:
Dean Harding: UTF-8 is a variable-length encoding, which is more complex to process than a fixed-length encoding. Also, see my comments on Gumbo's answer: basically, combining characters exist in all encodings (UTF-8, UTF-16 & UTF-32) and they require special handling. You can use the same special handling that you use for combining characters to also handle surrogate pairs in UTF-16, so for the most part you can ignore surrogates and treat UTF-16 just like a fixed encoding.
I've a little confused by the last part ("for the most part"). If UTF-16 is treated as fixed 16-bit encoding, what issues could this cause? What are the chances that there are characters outside of the BMP? If there are, what issues could this cause if you'd assumed two-byte characters?
I read the Wikipedia info on Surrogates but it didn't really make things any clearer to me!
Edit: I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"
Edit2:
I found another comment in "Is there any reason to prefer UTF-16 over UTF-8?" which I think explains this a little better:
Andrew Russell: For performance: UTF-8 is much harder to decode than UTF-16. In UTF-16 characters are either a Basic Multilingual Plane character (2 bytes) or a Surrogate Pair (4 bytes). UTF-8 characters can be anywhere between 1 and 4 bytes
This suggests the point being made was that UTF-16 would not have any three-byte characters, so by assuming 16bits, you wouldn't "totally screw up" by ending up one-byte off. But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!