2

I stumbled across weird behaviour of encoding/decoding string. Have a look at an example:

@Test
public void testEncoding() {
    String str = "\uDD71"; // {56689}
    byte[] utf16 = str.getBytes(StandardCharsets.UTF_16); // {-2, -1, -1, -3}
    String utf16String = new String(utf16, StandardCharsets.UTF_16); // {65533}
    assertEquals(str, utf16String);
}

I would assume this test will pass, but it is not the case. Could someone explain why the encoded and decoded string is not equal to the original one?

jingx
  • 3,698
  • 3
  • 24
  • 40
Dawid Wysakowicz
  • 3,402
  • 17
  • 33

1 Answers1

4

U+DD71 is not a valid codepoint, as U+D800..U+DFFF are reserved by Unicode so as not to cause confusion with UTF-16. As such, these codepoints should never appear as valid character data. From the Unicode standard:

Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range.

This works, though:

@Test
public void testEncoding() {
    String str = "\u0040";
    byte[] utf16 = str.getBytes(StandardCharsets.UTF_16);
    String utf16String = new String(utf16, StandardCharsets.UTF_16);
    assertEquals(str, utf16String);
}

So, it's not your code at fault, but that you're trying to use a codepoint that isn't valid.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
SeverityOne
  • 2,476
  • 12
  • 25
  • 1
    I have accepted your answer, but big thank you goes to @Johannes Kuhn, who was first and helped me understand the problem. – Dawid Wysakowicz May 13 '18 at 20:11
  • Yes, I saw his comment after I finished my answer. He knows more about the subject than I, but this is what a Google search told me. – SeverityOne May 13 '18 at 20:26