String differs after encoding and decoding

Question

I stumbled across weird behaviour of encoding/decoding string. Have a look at an example:

@Test
public void testEncoding() {
    String str = "\uDD71"; // {56689}
    byte[] utf16 = str.getBytes(StandardCharsets.UTF_16); // {-2, -1, -1, -3}
    String utf16String = new String(utf16, StandardCharsets.UTF_16); // {65533}
    assertEquals(str, utf16String);
}

I would assume this test will pass, but it is not the case. Could someone explain why the encoded and decoded string is not equal to the original one?

Which output would you like to see? the byte array? the utf16String? in what form? I think you can run this test yourself quite easily. In general the output is that those strings are different — Dawid Wysakowicz, May 13 '18 at 19:51
`\uDD71` is a low surrogate. Alone it is useless and does not denote any codepoint. Therefore it is replaced with `\uFFFD` — Johannes Kuhn, May 13 '18 at 19:56
If you prefer an exception to a replacement, avoid the String constructor. — Tom Blodget, May 15 '18 at 09:51

score 4 · Accepted Answer · edited May 15 '18 at 20:07

4

U+DD71 is not a valid codepoint, as U+D800..U+DFFF are reserved by Unicode so as not to cause confusion with UTF-16. As such, these codepoints should never appear as valid character data. From the Unicode standard:

Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range.

This works, though:

@Test
public void testEncoding() {
    String str = "\u0040";
    byte[] utf16 = str.getBytes(StandardCharsets.UTF_16);
    String utf16String = new String(utf16, StandardCharsets.UTF_16);
    assertEquals(str, utf16String);
}

So, it's not your code at fault, but that you're trying to use a codepoint that isn't valid.

edited May 15 '18 at 20:07

Remy Lebeau

555,201
31
458
770

answered May 13 '18 at 19:57

SeverityOne

2,476
12
25

1

I have accepted your answer, but big thank you goes to @Johannes Kuhn, who was first and helped me understand the problem. – Dawid Wysakowicz May 13 '18 at 20:11
Yes, I saw his comment after I finished my answer. He knows more about the subject than I, but this is what a Google search told me. – SeverityOne May 13 '18 at 20:26

String differs after encoding and decoding

1 Answers1