Understanding binary String delimiters

Question

I am confused about the difference between encodings that are

represented with \x, like \x68\x65\x6c\x6c\x6f vs.
ones using \u, such as \u0068\u0065\u006c\u006c\u006f.

I've been playing around with https://convertcodes.com/unicode-converter-encode-decode-utf/ and it seems like UTF-16 uses \u and UTF-8 uses \x, but from other sources I've read that \x is not specific to UTF-8 and \u is not specific to UTF-16. What is the difference, can both encodings use both of these delimiters? Furthermore, is the title of this question even correct? Can these be referred to as binary delimiters? Are the example strings (\x68\x65\x6c\x6c\x6f and \u0068\u0065\u006c\u006c\u006f) considered binary Strings, BLOBs, or something else? What is the proper name for these types of Strings?

What language are you talking about that uses these escape sequences in its string representation? But yes, for character codes <= FF they are exchangeable. — Bergi, Jun 20 '21 at 10:41
Read [String literal](https://en.cppreference.com/w/cpp/language/string_literal) and [Escape sequences](https://en.cppreference.com/w/cpp/language/escape) — JosefZ, Jun 20 '21 at 10:42
"*Can both encodings use both of these delimiters?*" - well, UTF-16 obviously needs a way to display 16-bit character codes, which `\x` does not offer. — Bergi, Jun 20 '21 at 10:45

score 2 · Accepted Answer · answered Jun 20 '21 at 11:16

Everything entirely depends on who interpretes it, and as such implies a minimum of context:

JSON only knows \u (not bound to a specific UTF encoding) and always wants 4 digits for it, doesn't know \x, and String literals must be enclosed in "double quotation marks".
PHP knows \x (expecting 1 or 2 digits) and \u (expecting any number of digits, bound to UTF-8 encoding) only when using "double quotation marks" for String literals.
MySQL knows neither of those escape sequences, and String literals can either be in 'single quotation marks' or "double quotation marks". One must use hexadecimal literals separately. This is unbound to any encoding used.
C++ knows \x (expecting 2 digits), \u (expecting 4 digits) and \U (expecting 8 digits), which in conjunction with a String's literal prefix then has a different outcome as per encoding. String literals are always in "double quotation marks", single character literals always in 'single quotation marks'.
Perl regular expressions know \x (expecting 2 digits) and \N (expecting a codepoint). Different RegEx flavors have different support, some also accept an \x with 4 digits. Mostly \x is bound to the input encoding (sometimes implied with the u modifier for UTF-8).

Understanding binary String delimiters

1 Answers1