Why does Java use modified UTF-8 instead of UTF-8?

Question

Why does Java use modified UTF-8 rather than standard UTF-8 for object serialization and JNI?

One possible explanation is that modified UTF-8 can't have embedded null characters and therefore one can use functions that operate on null-terminated strings with it. Are there any other reasons?

i couild ask you why are you trying to read serialized java objects not in java :-) — radai, Mar 15 '13 at 19:39
@radai: I am not reading anything, just asking a question. =) — vitaut, Mar 15 '13 at 19:41
in that case i think NPE is right. looks like they use it whenever they need to interact with C (serialization, JNI, class file parsing) — radai, Mar 15 '13 at 19:50
This decision was made by an employee of Sun a very, very, very long time ago. Probably that person knows the answer, and no one else does. All you are going to get here is speculation. — bmargulies, Mar 15 '13 at 21:06

ZhongYu · Accepted Answer · 2013-03-15T21:06:24.930

It is faster and simpler for handling supplementary characters (by not handling them).

Java represent characters as 16 bit chars, but unicode has evolved to contain more than 64K characters. So some characters, the supplementary characters, has to be encoded in 2 chars (surrogate pair) in Java.

Strict UTF-8 requires that the encoder converts surrogate pairs into characters then encode characters into bytes. The decoder needs to split supplementary characters back to surrogate pairs.

chars -> character -> bytes -> character -> chars

Since both ends are Java, we can take some shortcut and encode directly on the char level

char -> bytes -> char

neither encoder nor decoder need to worry about surrogate pairs.

A takeaway from this is to never use "modified UTF-8" (e.g. from DataOutputStream) for external storage that is not intended to be read back in from Java. — robinst, May 20 '15 at 06:43

NPE · Answer 2 · 2013-03-15T21:07:29.767

1

I suspect that's the main reason. In C land, having to deal with strings can contain embedded NULs would complicate things.

edited Mar 15 '13 at 21:07

answered Mar 15 '13 at 19:28

NPE

486,780
108
951
1,012

Remy Lebeau · Answer 3 · 2013-03-16T19:16:23.677

1

There is a good description of Modified UTF-8 in Unicode Explained - Page 306, but it does not explain why Modified UTF-8 was decided on.

There is also a very detailed explanation in Java's own documentation of how support for non-BMP Unicode characters was originally added to Java: Supplementary Characters in the Java Platform. But again, no explanation as to why Modified UTF-8 was decided on.

I don't think you are going to find a why unless you ask Java's architects directly.

edited Mar 16 '13 at 19:16

answered Mar 16 '13 at 18:56

Remy Lebeau

555,201
31
458
770

That's a great description of _how_, but I don't see any info on _why_ – Jonas Høgh Mar 16 '13 at 19:04

Why does Java use modified UTF-8 instead of UTF-8?

3 Answers3