c++: How to support surrogate characters in utf8

Question

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.

I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.

What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.

Surrogate pairs are a way of encoding code points that are not on the BMP and are too large to store in UTF-16. UTF-8 can simply store these codepoints and I am pretty sure that any tool that saw a UTF-16 surrogate pair encoded (double encoded) in UTF-8 as two codepoints would treat the situation as an error. — Chris Becke, Mar 02 '17 at 13:25
Oh, except that it should be "too large to store in UTC-2", UTF-16 is how you store values upto just over 1,000,000 while using a 16-bit basic block (and surrogate pairs are the answer). — Martin Bonner supports Monica, Mar 02 '17 at 13:29
Note that UTF-16 also requires the use of surrogate pairs. You will need to use UTF-32 to avoid them. — Richard Critten, Mar 02 '17 at 14:42
@MartinBonner: "*Oh, except that it should be "too large to store in **UCS-2**",*" — Remy Lebeau, Mar 08 '17 at 22:59
@RichardCritten: *Only* UTF-16 requires the use of surrogate pairs. Correct UTF-8 and UTF-32 do not use them and should not use them. There are some incorrect tools that make malformed UTF-8 from UTF-16 with surrogate pairs. This is known by the term "CESU-8". — hippietrail, Nov 09 '19 at 10:28

score 6 · Answer 1 · answered Mar 08 '17 at 23:08

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes).

Why not the entire Unicode repertoire (4 bytes)? Why limited to only 3 bytes? 3 bytes gets you support for codepoints only up to U+FFFF. 4 bytes gets you support for an additional 1048576 codepoints, all the way up to U+10FFFF.

However, there is a requirement where it needs to support Surrogate pairs.

Surrogate pairs only apply to UTF-16, not to UTF-8 or even UCS-2 (the predecessor to UTF-16).

I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?

The codepoints that are used for encoding surrogates can be physically encoded in UTF-8, however they are reserved by the Unicode standard and are illegal to use outside of UTF-16 encoding. UTF-8 has no need for surrogate pairs, and any decoded Unicode string that contains surrogate codepoints in it should be considered malformed.

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

We can't answer that, since you have not provided any information about how your project is set up, what compiler you are using, etc.

However, you don't need to switch the application to UTF-16. You just need to update your code to support the 4-byte encoding of UTF-8, and make sure you support surrogate pairs when converting 16-bit data to UTF-8. Don't limit yourself to U+FFFF as the highest possible codepoint. Unicode has many many more codepoints than that.

It sounds like your code only handles UCS-2 when converting data to/from UTF-8. Just update that code to support UTF-16 instead of UCS-2, and you should be fine.

score 3 · Answer 2 · answered Mar 02 '17 at 13:32

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.

So convert the utf-16 encoded strings to utf-8. Documentation here: http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

Wrong question. Use UTF-8 internally.

What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.

See above. Convert UTF-16 to UTF-8 for inbound data and convert back to UTF-16 outbound when necessary.

For new apps running on Windows or written in Java I would use UTF-16 internally as it's native on those platforms. For *nix, MacOs, and cross-platform I would use UTF-8. I'm pretty sure Android must use UTF-16 natively and iOS must use UTF-8 internally. I would always go with the native Unicode format except for cross-platform apps. For existing apps stick with what you have so for OP yes stick with UTF-8. That might not apply to anyone else dropping by to read this QA though. — hippietrail, Nov 09 '19 at 10:39

c++: How to support surrogate characters in utf8

2 Answers2

Linked