This is only for the method writeUTF
of DataOutputStream, not for normal converted streams (OutputStreamWriter or such).
It means that if you have a string "\u0000"
, it will be encoded as 0xC0 0x80
instead of simply 0x00
.
And in the other way around, this sequence 0xB0 0x80
, which will never occur in normal UTF-8 strings, represents a nul character.
Also, the documentation you linked seems to be from the time when Unicode still was a 16-bit character set - nowadays it also allows characters over 0xFFFF, which will be represented by two Java char
values each (in UTF-16 format, a surrogate pair), and will need 4 bytes in UTF-8, if I calculated right. I'm note sure about the implementation here, though - it looks like these are simply written in CESU-8 format (e.g. two 3-byte sequences, each corresponding to a UTF-16 surrogate, which together give one Unicode character). You will have to take care of this, too.
If you are using Java, the simplest thing would be to use DataInputStream to read this into a string, and then convert it (with getBytes("UTF-8")
or a OutputStreamWriter to real UTF-8 data.