0

i have a little confused

// default charset utf8
val bytes = byteArrayOf(78, 23, 41, 51, -32, 42)
val str = String(bytes)
// there i got array [78, 23, 41, 51, -17, -65, -67, 42]
val weird = str.toByteArray()

i put random value into the bytes property, for some reason. why is it inconsistent???

Dean
  • 135
  • 1
  • 1
  • 7
  • `[-17, -65, -67]` array (hexadecimal `0xEF,0xBF,0xBD`) is Byte Order Mark (UTF-8) (appears as `�` in latin1). – JosefZ Jan 11 '21 at 13:18
  • 2
    @JosefZ No, that's `0xEF,0xBB,0xBF`. – Alexey Romanov Jan 11 '21 at 18:09
  • My bad. In fact, `[-17, -65, -67]` byte array (hexadecimal `0xEF,0xBF,0xBD`) which appears as `�` in latin1 is � `U+FFFD` *Replacement Character*. Thanks @AlexeyRomanov: *Byte Order Mark* is different: `U+FEFF` (hexa `0xEF,0xBB,0xBF`, latin1 ``) *Zero Width No-Break Space*. – JosefZ Jan 11 '21 at 20:28

1 Answers1

7

The issue here is that your bytes aren't a valid UTF-8 sequence.

Any sequence of bytes can be interpreted as valid ISO Latin-1, for example.  (There may be issues with bytes having values 0–31, but those generally don't stop the characters being stored and processed.)  Similar applies to most other 8-bit character sets.

But the same isn't true of UTF-8.  While all sequences of bytes in the range 1–127 are valid UTF-8 (and interpreted the same as they are in ASCII and most 8-bit encodings), bytes in the range 128–255 can only appear in certain well-defined combinations.  (This has several very useful properties: it lets you identify UTF-8 with a very high probability; it also avoids issues with synchronisation, searching, sorting, &c.)

In this case, the sequence in the question (which is 4E 17 29 33 E0 2A in unsigned hex) isn't valid UTF-8.

So when you try to convert it to a string using the default encoding (UTF-8), the JVM substitutes the replacement character — value U+FFFD, which looks like this: — in place of each invalid character.

Then, when you convert that back to UTF-8, you get the UTF-8 encoding of the replacment character, which is EF BF BD.  And if you interpret that as signed bytes, you get -17 -65 -67 — as in the question.

So Kotlin/JVM is handling the invalid input as best it can.

gidds
  • 16,558
  • 2
  • 19
  • 26