Java UTF-8 differences

Question

The JavaDoc says "The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls."

But what does this even mean? What's an embedded null in this context? I am trying to convert from a Java saved UTF-8 string to "real" UTF-8.

Use `readUTF` on the datastream to get a (real) unicode string. — hakre, Jun 22 '11 at 12:31
@hakre, thanks. But I don't think I can change the behavior of that program, I just have to deal with what it gives me. — Prof. Falken, Jun 22 '11 at 12:33
Is the problem you're having in the data written by a `DataOutputStream`, or in the data read by a `DataInputStream`? — Matt Ball, Jun 22 '11 at 12:36
It's written by a DataOutputStream. Then I tried to read this in a (supposedly) UTF-8 aware C program, which I am hacking on. @Matt Ball — Prof. Falken, Jun 22 '11 at 12:38
@Amigable Clark Kant: You should probably note that you need to load it from C in your question ;) — hakre, Jun 22 '11 at 13:37
@hakre, maybe, but the question was about how Javas format is. That I read it from C is incidental. — Prof. Falken, Jun 22 '11 at 13:55
Java generates invalid UTF-8 for such things. A strict interpretation must guard against that. It’s a problem. — tchrist, Jun 23 '11 at 15:58

Thorbjørn Ravn Andersen · Accepted Answer · 2011-06-22T12:37:56.513

16

In C a string is terminated by the byte value 00.

The thing here is that you can have 0-chars in Java strings but to avoid confusion when passing the string over to C (which all native methods are written in) the character is encoded in another way, namely as two bytes

11000000 10000000

(according to the javadoc) neither of which is actually 00.

This is a hack to work around something you cannot change easily.

Also note, that this is valid UTF-8 and decode correctly to 00.

edited Jun 22 '11 at 12:37

answered Jun 22 '11 at 12:27

Thorbjørn Ravn Andersen

73,784
33
194
347

Thank you! This does indeed answer my question. Unfortunately, it did not give me any insights into why my program fails, but I will keep looking. :-) – Prof. Falken Jun 22 '11 at 12:37
2

I'm not sure about *this is valid UTF-8* - a naive decoder will decode it to 0, but [RFC 3629](http://tools.ietf.org/html/rfc3629#page-5) says clearly: *Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000 [...]. – Paŭlo Ebermann Jun 22 '11 at 12:50
@Paŭlo Ebermann, hm? What do you mean? What does overlong mean? – Prof. Falken Jun 22 '11 at 12:56
5

@Amigable: To be clearer: It was legal UTF-8 before version 3.1. Since 3.1, each character must be encoded in the shortest form possible. This is also mentioned in the definition of UTF-8 on page 93f in the current version 6.0 of the Unicode standard (http://www.unicode.org/versions/Unicode6.0.0/, chapter *Conformance*). (UTF-8 sequences which map to surrogates are also invalid). – Paŭlo Ebermann Jun 22 '11 at 13:12
2

There are other languages that use this, too. I know Tcl does and I would assume there's more. – RHSeeger Jun 22 '11 at 13:21

score 4 · Answer 2 · edited Jun 22 '11 at 12:45

4

No "embedded nulls" means that the raw data does not contain a single 0x00 (NULL) byte.

\u0000 gets encoded to (binary) 11000000 10000000, (hex) 0xC080.

edited Jun 22 '11 at 12:45

Prof. Falken

24,226
19
100
173

answered Jun 22 '11 at 12:28

Mat

202,337
40
393
406

Matt Ball · Answer 3 · 2011-06-22T12:38:38.140

1

That's not a Java-wide difference, only in DataInput/OutputStream. If the string data was written using DataOutputStream then just read it in using DataInputStream.

If you need to write the string data to, say, a file, don't use DataOutputStream, use a Writer, which is meant for character streams.

edited Jun 22 '11 at 12:38

answered Jun 22 '11 at 12:33

Matt Ball

354,903
100
647
710

Thanks, but I can't change the Java program at this time, only deal with the output it creates. – Prof. Falken Jun 22 '11 at 12:39
Indeed. I may be able to change that later though, if I get a chance to update all the other clients. – Prof. Falken Jun 22 '11 at 12:41

Paŭlo Ebermann · Answer 4 · 2011-06-22T12:50:47.967

1

This is only for the method writeUTF of DataOutputStream, not for normal converted streams (OutputStreamWriter or such).

It means that if you have a string "\u0000", it will be encoded as 0xC0 0x80 instead of simply 0x00.

And in the other way around, this sequence 0xB0 0x80, which will never occur in normal UTF-8 strings, represents a nul character.

Also, the documentation you linked seems to be from the time when Unicode still was a 16-bit character set - nowadays it also allows characters over 0xFFFF, which will be represented by two Java char values each (in UTF-16 format, a surrogate pair), and will need 4 bytes in UTF-8, if I calculated right. I'm note sure about the implementation here, though - it looks like these are simply written in CESU-8 format (e.g. two 3-byte sequences, each corresponding to a UTF-16 surrogate, which together give one Unicode character). You will have to take care of this, too.

If you are using Java, the simplest thing would be to use DataInputStream to read this into a string, and then convert it (with getBytes("UTF-8") or a OutputStreamWriter to real UTF-8 data.

edited Jun 22 '11 at 12:50

answered Jun 22 '11 at 12:40

Paŭlo Ebermann

73,284
20
146
210

Thanks, interesting. And also thanks for introducing CESU-8, didn't know about that acronym. – Prof. Falken Jun 22 '11 at 12:43
Also, for our purposes we never go outside the 16-bit character set. – Prof. Falken Jun 22 '11 at 12:45
Did you mean "represented by Java char values"? If so, I never understood how, because chars are only 16 bit right? – Prof. Falken Jun 22 '11 at 12:47
It should have been *represented by **two** Java char values*. Thanks for noting. – Paŭlo Ebermann Jun 22 '11 at 12:51
1

The JavaDocs that the OP linked were from Java 1.4.2. Here are the latest: http://download.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8 – Matt Ball Jun 22 '11 at 12:54
+1 Well, if you took your time writing it, I can read it right? :-) – Prof. Falken Jun 22 '11 at 12:57

score 0 · Answer 5 · answered Jun 22 '11 at 13:24

If you are having difficulty reading a "saved" Java string, you need to look at the specification for the methods that read/write in that format:

If the string was written using DataOutput.writeUTF8, the DataInput.readUTF8() javadoc is a definitive spec. In addition to the non-standard handling of NUL, it specifies that the string starts with an unsigned 16-bit byte count.
If the string was written using ObjectOutputStream.writeObject() then the serialization spec is definitive.

Java UTF-8 differences

5 Answers5