Inquiry about method readUTF() of class DataInputStream

Question

Does anybody know how it works under the hood? I have read this API, however it's not that clear. Could anybody put it down in a more simplistic way? Thanks in advance.

What part of the Javadoc you cited didn't you understand? That's the specification: it's important that you be able to understand documents like this. — user207421, Jul 31 '13 at 09:53

score 1 · Answer 1 · answered Jul 31 '13 at 09:50

first an unsigned short is read, which is the length of the string.
repeat for length of string the following steps:
read a byte. if byte matches bit pattern 0xxxxxxx then it is 1 character. If byte matches bit pattern 110xxxxx then the character consists of 2 bytes (unicode). If byte matches bit pattern 1110xxxx then the character consists of 3 bytes. When this new character is assembled it is appended to the end of the string to be returned.

Seeing the code behind the function may help:

 public final static String readUTF(DataInput in) throws IOException {
int utflen = in.readUnsignedShort();
byte[] bytearr = null;
char[] chararr = null;
if (in instanceof DataInputStream) {
    DataInputStream dis = (DataInputStream)in;
    if (dis.bytearr.length < utflen){
        dis.bytearr = new byte[utflen*2];
        dis.chararr = new char[utflen*2];
    }
    chararr = dis.chararr;
    bytearr = dis.bytearr;
} else {
    bytearr = new byte[utflen];
    chararr = new char[utflen];
}

int c, char2, char3;
int count = 0;
int chararr_count=0;

in.readFully(bytearr, 0, utflen);

while (count < utflen) {
    c = (int) bytearr[count] & 0xff;
    if (c > 127) break;
    count++;
    chararr[chararr_count++]=(char)c;
}

while (count < utflen) {
    c = (int) bytearr[count] & 0xff;
    switch (c >> 4) {
        case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
            /* 0xxxxxxx*/
            count++;
            chararr[chararr_count++]=(char)c;
            break;
        case 12: case 13:
            /* 110x xxxx   10xx xxxx*/
            count += 2;
            if (count > utflen)
                throw new UTFDataFormatException(
                    "malformed input: partial character at end");
            char2 = (int) bytearr[count-1];
            if ((char2 & 0xC0) != 0x80)
                throw new UTFDataFormatException(
                    "malformed input around byte " + count);
            chararr[chararr_count++]=(char)(((c & 0x1F) << 6) |
                                            (char2 & 0x3F));
            break;
        case 14:
            /* 1110 xxxx  10xx xxxx  10xx xxxx */
            count += 3;
            if (count > utflen)
                throw new UTFDataFormatException(
                    "malformed input: partial character at end");
            char2 = (int) bytearr[count-2];
            char3 = (int) bytearr[count-1];
            if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80))
                throw new UTFDataFormatException(
                    "malformed input around byte " + (count-1));
            chararr[chararr_count++]=(char)(((c     & 0x0F) << 12) |
                                            ((char2 & 0x3F) << 6)  |
                                            ((char3 & 0x3F) << 0));
            break;
        default:
            /* 10xx xxxx,  1111 xxxx */
            throw new UTFDataFormatException(
                "malformed input around byte " + count);
    }
}
// The number of chars produced may be less than utflen
return new String(chararr, 0, chararr_count);

}

'2 bytes (Unicode)' is not an accurate reduction of what it says in the Javadoc. — user207421, Jul 31 '13 at 09:56

Inquiry about method readUTF() of class DataInputStream

1 Answers1