-2

Does anybody know how it works under the hood? I have read this API, however it's not that clear. Could anybody put it down in a more simplistic way? Thanks in advance.

Rollerball
  • 12,618
  • 23
  • 92
  • 161
  • What part of the Javadoc you cited didn't you understand? That's the specification: it's important that you be able to understand documents like this. – user207421 Jul 31 '13 at 09:53

1 Answers1

1
  1. first an unsigned short is read, which is the length of the string.
  2. repeat for length of string the following steps:
  3. read a byte. if byte matches bit pattern 0xxxxxxx then it is 1 character. If byte matches bit pattern 110xxxxx then the character consists of 2 bytes (unicode). If byte matches bit pattern 1110xxxx then the character consists of 3 bytes. When this new character is assembled it is appended to the end of the string to be returned.

Seeing the code behind the function may help:

 public final static String readUTF(DataInput in) throws IOException {
int utflen = in.readUnsignedShort();
byte[] bytearr = null;
char[] chararr = null;
if (in instanceof DataInputStream) {
    DataInputStream dis = (DataInputStream)in;
    if (dis.bytearr.length < utflen){
        dis.bytearr = new byte[utflen*2];
        dis.chararr = new char[utflen*2];
    }
    chararr = dis.chararr;
    bytearr = dis.bytearr;
} else {
    bytearr = new byte[utflen];
    chararr = new char[utflen];
}

int c, char2, char3;
int count = 0;
int chararr_count=0;

in.readFully(bytearr, 0, utflen);

while (count < utflen) {
    c = (int) bytearr[count] & 0xff;
    if (c > 127) break;
    count++;
    chararr[chararr_count++]=(char)c;
}

while (count < utflen) {
    c = (int) bytearr[count] & 0xff;
    switch (c >> 4) {
        case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
            /* 0xxxxxxx*/
            count++;
            chararr[chararr_count++]=(char)c;
            break;
        case 12: case 13:
            /* 110x xxxx   10xx xxxx*/
            count += 2;
            if (count > utflen)
                throw new UTFDataFormatException(
                    "malformed input: partial character at end");
            char2 = (int) bytearr[count-1];
            if ((char2 & 0xC0) != 0x80)
                throw new UTFDataFormatException(
                    "malformed input around byte " + count);
            chararr[chararr_count++]=(char)(((c & 0x1F) << 6) |
                                            (char2 & 0x3F));
            break;
        case 14:
            /* 1110 xxxx  10xx xxxx  10xx xxxx */
            count += 3;
            if (count > utflen)
                throw new UTFDataFormatException(
                    "malformed input: partial character at end");
            char2 = (int) bytearr[count-2];
            char3 = (int) bytearr[count-1];
            if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80))
                throw new UTFDataFormatException(
                    "malformed input around byte " + (count-1));
            chararr[chararr_count++]=(char)(((c     & 0x0F) << 12) |
                                            ((char2 & 0x3F) << 6)  |
                                            ((char3 & 0x3F) << 0));
            break;
        default:
            /* 10xx xxxx,  1111 xxxx */
            throw new UTFDataFormatException(
                "malformed input around byte " + count);
    }
}
// The number of chars produced may be less than utflen
return new String(chararr, 0, chararr_count);

}

David Xu
  • 5,555
  • 3
  • 28
  • 50