-1

I am confused about the difference between encodings that are

  • represented with \x, like \x68\x65\x6c\x6c\x6f vs.
  • ones using \u, such as \u0068\u0065\u006c\u006c\u006f.

I've been playing around with https://convertcodes.com/unicode-converter-encode-decode-utf/ and it seems like UTF-16 uses \u and UTF-8 uses \x, but from other sources I've read that \x is not specific to UTF-8 and \u is not specific to UTF-16. What is the difference, can both encodings use both of these delimiters? Furthermore, is the title of this question even correct? Can these be referred to as binary delimiters? Are the example strings (\x68\x65\x6c\x6c\x6f and \u0068\u0065\u006c\u006c\u006f) considered binary Strings, BLOBs, or something else? What is the proper name for these types of Strings?

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
Eric Grossman
  • 219
  • 1
  • 4
  • 9
  • 2
    What language are you talking about that uses these escape sequences in its string representation? But yes, for character codes <= FF they are exchangeable. – Bergi Jun 20 '21 at 10:41
  • 1
    Read [String literal](https://en.cppreference.com/w/cpp/language/string_literal) and [Escape sequences](https://en.cppreference.com/w/cpp/language/escape) – JosefZ Jun 20 '21 at 10:42
  • 1
    "*Can both encodings use both of these delimiters?*" - well, UTF-16 obviously needs a way to display 16-bit character codes, which `\x` does not offer. – Bergi Jun 20 '21 at 10:45

1 Answers1

2

Everything entirely depends on who interpretes it, and as such implies a minimum of context:

  • JSON only knows \u (not bound to a specific UTF encoding) and always wants 4 digits for it, doesn't know \x, and String literals must be enclosed in "double quotation marks".
  • PHP knows \x (expecting 1 or 2 digits) and \u (expecting any number of digits, bound to UTF-8 encoding) only when using "double quotation marks" for String literals.
  • MySQL knows neither of those escape sequences, and String literals can either be in 'single quotation marks' or "double quotation marks". One must use hexadecimal literals separately. This is unbound to any encoding used.
  • C++ knows \x (expecting 2 digits), \u (expecting 4 digits) and \U (expecting 8 digits), which in conjunction with a String's literal prefix then has a different outcome as per encoding. String literals are always in "double quotation marks", single character literals always in 'single quotation marks'.
  • Perl regular expressions know \x (expecting 2 digits) and \N (expecting a codepoint). Different RegEx flavors have different support, some also accept an \x with 4 digits. Mostly \x is bound to the input encoding (sometimes implied with the u modifier for UTF-8).

See also: what is a String literal.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
  • for utf-8 is \u0068 even valid, or does 4 digits imply utf-16? – Eric Grossman Jun 20 '21 at 16:02
  • 1
    Yes, since [`\x00` to `\x7f` in UTF-8 equals to ASCII](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout). No, if 4 digits would imply UTF-16 then nobody would have a chance to insert a codepoint in the [range of U+0080 to U+07FF in i.e. UTF-8](https://en.wikipedia.org/wiki/UTF-8#Encoding). – AmigoJack Jun 20 '21 at 16:15