12

Why does Java use modified UTF-8 rather than standard UTF-8 for object serialization and JNI?

One possible explanation is that modified UTF-8 can't have embedded null characters and therefore one can use functions that operate on null-terminated strings with it. Are there any other reasons?

vitaut
  • 49,672
  • 25
  • 199
  • 336
  • i couild ask you why are you trying to read serialized java objects not in java :-) – radai Mar 15 '13 at 19:39
  • 1
    @radai: I am not reading anything, just asking a question. =) – vitaut Mar 15 '13 at 19:41
  • 1
    in that case i think NPE is right. looks like they use it whenever they need to interact with C (serialization, JNI, class file parsing) – radai Mar 15 '13 at 19:50
  • 1
    This decision was made by an employee of Sun a very, very, very long time ago. Probably that person knows the answer, and no one else does. All you are going to get here is speculation. – bmargulies Mar 15 '13 at 21:06
  • 1
    And he'll take his secret to grave! – ZhongYu Mar 15 '13 at 21:11

3 Answers3

10

It is faster and simpler for handling supplementary characters (by not handling them).

Java represent characters as 16 bit chars, but unicode has evolved to contain more than 64K characters. So some characters, the supplementary characters, has to be encoded in 2 chars (surrogate pair) in Java.

Strict UTF-8 requires that the encoder converts surrogate pairs into characters then encode characters into bytes. The decoder needs to split supplementary characters back to surrogate pairs.

chars -> character -> bytes -> character -> chars

Since both ends are Java, we can take some shortcut and encode directly on the char level

char -> bytes -> char

neither encoder nor decoder need to worry about surrogate pairs.

ZhongYu
  • 19,446
  • 5
  • 33
  • 61
  • A takeaway from this is to never use "modified UTF-8" (e.g. from DataOutputStream) for external storage that is not intended to be read back in from Java. – robinst May 20 '15 at 06:43
1

I suspect that's the main reason. In C land, having to deal with strings can contain embedded NULs would complicate things.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
1

There is a good description of Modified UTF-8 in Unicode Explained - Page 306, but it does not explain why Modified UTF-8 was decided on.

There is also a very detailed explanation in Java's own documentation of how support for non-BMP Unicode characters was originally added to Java: Supplementary Characters in the Java Platform. But again, no explanation as to why Modified UTF-8 was decided on.

I don't think you are going to find a why unless you ask Java's architects directly.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770