5

I read that Java uses UTF-16 encoding internally. i.e. I understand that if I have like: String var = "जनमत"; then the "जनमत" will be encoded in UTF-16 internally. So, If I dump this variable to some file such as below:

fileOut = new FileOutputStream("output.xyz");
out = new ObjectOutputStream(fileOut);
out.writeObject(var);

will the encoding of the string "जनमत" in the file "output.xyz" be in UTF-16? Also, later on if I want to read from the file "output.xyz" via ObjectInputStream, will I be able to get the UTF-16 representation of the variable?

Thanks.

Buhake Sindi
  • 87,898
  • 29
  • 167
  • 228
Bikash Gyawali
  • 969
  • 2
  • 15
  • 33
  • I don't think you should care about encoding used by `ObjectOutputStream`. If you are going to use the generated file somewhere else, just don't use `ObjectOutputStream`. If not, you don't have to think about it. – khachik Dec 08 '10 at 17:37
  • 2
    you really shouldn't be directly putting non-ASCII characters in *.java* source file, this has been debated here and there ad nauseam. Basically, *.java* files have no metadata associated to them telling in which encoding they're encoded nor any specification mandating any specific encoding. Hence the sh!t **SHALL** hit the fan sooner or later when you mix OSes, IDEs, text editors, tools (batch/shell scripts), etc. You should always either externalize your non-ASCII characters to other files (on which you have complete control over their encoding) or use the *\uXXXX* Java escaping. – SyntaxT3rr0r Dec 08 '10 at 17:43
  • to answer your question, no, the fact that Java may or may not use UTF-16 or UCS-2 (or the colors of the moonboots little fearing are wearing) to store strings internally has no effect at all on the encoding used when saving said string to a file. – SyntaxT3rr0r Dec 08 '10 at 17:45

3 Answers3

7

So, If I dump this variable to some file... will the encoding of the string "जनमत" in the file "output.xyz" be in UTF-16?

The encoding of your string in the file will be in whatever format the ObjectOutputStream wants to put it in. You should treat it as a black box that can only be read by an ObjectInputStream. (Seriously - even though the format is IIRC well-documented, if you want to read it with some other tool, you should serialise the object yourself as XML or JSON or whatever.)

Later on if I want to read from the file "output.xyz" via ObjectInputStream, will I be able to get the UTF-16 representation of the variable?

If you read the file with an ObjectInputStream, you'll get a copy of the original object back. This will include a java.lang.String, which is a just stream of characters (not bytes) - from which you could get the UTF-16 representation if you wished via the getBytes() method (though I suspect you don't actually need to).


In conclusion, don't worry too much about the internal details of serialization. If you need to know what's going on, create the file yourself; and if you're just curious, trust in the JVM to do the right thing.

Andrzej Doyle
  • 102,507
  • 33
  • 189
  • 228
1

Close: it is not exactly UTF-16, but something like UCS-2; but either way it does use 2 bytes for most characters (and sequence of 2 chars, i.e. 4 bytes for some rarely used code points).

ObjectOutputStream uses something called modified UTF-8, which is like UTF-8 but where zero character is expressed as 2-byte sequence which is not legal as per UTF-8 (due to uniqueness restrictions of encoding), but that sort of naturally decodes back to value 0.

But what you are really asking is "does it work so that I write a String, read a String" -- and answer to that is yes. JDK does proper encoding when writing bytes out, and decoding when reading.

For what it's worth, you are better of using "writeUTF()" method for Strings, since I think resulting output is bit more compact. but "writeObject()" also works, just needs bit more metadata.

StaxMan
  • 113,358
  • 34
  • 211
  • 239
0

Just to add on this, ObjectOutputStream.writeString() will determing the UTF length of a given string and write it in "standard" UTF or in "long" UTF format where "long" as stated in the javadoc

"Long" UTF format is identical to standard UTF, except that it uses an 8 byte header (instead of the standard 2 bytes) to convey the UTF encoding length.

I got this from code...

private void writeString(String str, boolean unshared) throws IOException {
    handles.assign(unshared ? null : str);
    long utflen = bout.getUTFLength(str);
    if (utflen <= 0xFFFF) {
        bout.writeByte(TC_STRING);
        bout.writeUTF(str, utflen);
    } else {
        bout.writeByte(TC_LONGSTRING);
        bout.writeLongUTF(str, utflen);
    }
}

and in writeObject(Object obj) they do a check

if (obj instanceof String) {
    writeString((String) obj, unshared);
}
Buhake Sindi
  • 87,898
  • 29
  • 167
  • 228