Reading a single UTF-8 character with RandomAccessFile

Question

I've set up a sequential scanner, where a RandomAccessFile pointing to my file is able to read a single character, via the below method:

public char nextChar() {
    try {
        seekPointer++;
        int i = source.read();
        return i > -1 ? (char) i : '\0'; // INFO: EOF character is -1.
    } catch (IOException e) {
        e.printStackTrace();
    }
    return '\0';
}

The seekPointer is just a reference for my program, but the method stores source.read() in an int, and then returns it casted to a char if its not the end of the file. But these chars that I'm receiving are in ASCII format, infact its so bad that I can't even use a symbol such as ç.

Is there a way that I can receive a single character, that is in UTF-8 format or atleast something standardised that allows more than just the ASCII character set?

I know I can use readUTF() but that returns an entire line as a String, which is not what I am after.

Also, I can't simply use another stream reader, because my program requires a seek(int) function, allowing me to move back and forth in the file.

@TamasHegedus Updated the question. I require a seek function. — , Feb 17 '17 at 23:31
As @WillisBlackburn points out in his detailed answer below, you cannot select a random byte offset in a UTF-8 file and be guaranteed to get a "character". You might have to back up to find the start of a multi-byte sequence. Is this what you had in mind? — Jim Garrison, Feb 18 '17 at 00:00
@JimGarrison Well I'm trying to make an algorithm out of his answer but its not doing very great. So no, not what I had in mind, something more along Adam's answer. I'm just seeing what works at the moment. — , Feb 18 '17 at 00:03
You'll have to define "works" a little better. Assuming a UTF-8 encoded file that may contain multi-byte sequences, what do you expect to occur? If you can clarify what you want to happen in all situations we might be able to help you. — Jim Garrison, Feb 18 '17 at 00:05
@JimGarrison Right, well I believe I can pick up on the sequence with some byte manipulation, and measure the amount of bytes to read, I am capable of reading all of the required bytes, and then I convert the bytes using `new String(char[])`, and what I am left with, is only the second byte of the character, converted to a string. :/ — , Feb 18 '17 at 00:07
That's because you need to use the `String(byte[] bytes, Charset c)` constructor and specify UTF-8. Otherwise it will assume your platform default character set. — Jim Garrison, Feb 18 '17 at 00:08
@JimGarrison Well, after some attempts, looks like I've got it. It works! — , Feb 18 '17 at 00:24

score 2 · Answer 1 · 2017-02-18T01:22:44.500

Building from Willis Blackburn's answer, I can simply do some integer checks to make sure that they exceed a certain number, to get the amount of characters I need to check ahead.

Judging by the following table:

first byte starts with 0                         1 byte char
first byte starts with 10    >= 128 && <= 191    ? byte(s) char
first byte starts with 11        >= 192          2 bytes char
first byte starts with 111       >= 224          3 bytes char
first byte starts with 1111      >= 240          4 bytes char

We can check the integer read from RandomAccessFile.read() by comparing it against the numbers in the middle column, which are literally just the integer representations of a byte. This allows us to skip byte conversion completely, saving time.

The following code, will read a character from a RandomAccessFile, with a byte-length of 1-4:

int seekPointer = 0;
RandomAccessFile source; // initialise in your own way

public void seek(int shift) {
    seekPointer += shift;
    if (seekPointer < 0) seekPointer = 0;
    try {
        source.seek(seekPointer);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

private int byteCheck(int chr) {
    if (chr == -1) return 1; // eof
    int i = 1; // theres always atleast one byte
    if (chr >= 192) i++; // 2 bytes
    if (chr >= 224) i++; // 3 bytes
    if (chr >= 240) i++; // 4 bytes
    if (chr >= 128 && chr <= 191) i = -1; // woops, we're halfway through a char!
    return i;
}

public char nextChar() {
    try {
        seekPointer++;
        int i = source.read();

        if (byteCheck(i) == -1) {
            boolean malformed = true;
            for (int k = 0; k < 4; k++) { // Iterate 3 times.
                // we only iterate 3 times because the maximum size of a utf-8 char is 4 bytes.
                // any further and we may possibly interrupt the other chars.
                seek(-1);
                i = source.read();
                if (byteCheck(i) != -1) {
                    malformed = false;
                    break;
                }
            }
            if (malformed) {
                seek(3);
                throw new UTFDataFormatException("Malformed UTF char at position: " + seekPointer);
            }
        }

        byte[] chrs = new byte[byteCheck(i)];
        chrs[0] = (byte) i;

        for (int j = 1; j < chrs.length; j++) {
            seekPointer++;
            chrs[j] = (byte) source.read();
        }

        return i > -1 ? new String(chrs, Charset.forName("UTF-8")).charAt(0) : '\0'; // EOF character is -1.
    } catch (IOException e) {
        e.printStackTrace();
    }
    return '\0';
}

This is probably about right. You should decide what you want to do if the byte starts with 10 (in other words >= 128). In that case you're looking at a byte in the middle of a character and should either back up or read forward until you find a starting byte. — Willis Blackburn, Feb 18 '17 at 00:39
@WillisBlackburn Well the way I designed my program, I won't actually need it, but its going to be a good learning curve so I'll go do that now! — , Feb 18 '17 at 00:40
@WillisBlackburn Already have. You got a few downvotes. Ill accept your answer too though, because without it I would be stuck. Thankyou very much. — , Feb 18 '17 at 00:41
Appreciate it. It's all about the points. :-) I wish the down voters would comment to explain why. — Willis Blackburn, Feb 18 '17 at 00:42
I enjoyed answering your question because UTF-8 is such an elegant character-encoding solution and it's fun to explain how it works. It can read ASCII directly, it's as efficient as ASCII for encoding characters in the ASCII set, and the reader can distinguish initial from subsequent bytes in multibyte characters. Supposedly Ken Thompson designed it on a placemat at a diner in New Jersey. — Willis Blackburn, Feb 18 '17 at 00:49

Willis Blackburn · Accepted Answer · 2017-02-18T00:50:52.273

I'm not entirely sure what you're trying to do, but let me give you some information that might help.

The UTF-8 encoding represents characters as either 1, 2, 3, or 4 bytes depending on the Unicode value of the character.

For characters 0x00-0x7F, UTF-8 encodes the character as a single byte. This is a very useful property because if you're only dealing with 7-bit ASCII characters, the UTF-8 and ASCII encodings are identical.
For characters 0x80-0x7FF, UTF-8 uses 2 bytes: the first byte is binary 110 followed by the 5 high bits of the character, while the second byte is binary 10 followed by the 6 low bits of the character.
The 3- and 4-byte encodings are similar to the 2-byte encoding, except that the first byte of the 3-byte encoding starts with 1110 and the first byte of the 4-byte encoding starts with 11110.
See Wikipedia for all the details.

Now this may seem pretty byzantine but the upshot of it is this: you can read any byte in a UTF-8 file and know whether you're looking at a standalone character, the first byte of a multibyte character, or one of the other bytes of a multibyte character.

If the byte you read starts with binary 0, you're looking at a single-byte character. If it starts with 110, 1110, or 11110, then you have the first byte of a multibyte character of 2, 3, or 4 bytes, respectively. If it starts with 10, then it's one of the subsequent bytes of a multibyte character; scan backwards to find the start of it.

So if you want to let your caller seek to any random position in a file and read the UTF-8 character there, you can just apply the algorithm above to find the first byte of that character (if it's not the one at the specified position) and then read and decode the value.

See the Java Charset class for a method to decode UTF-8 from the source bytes. There may be easier ways but Charset will work.

Update: This code should handle the 1- and 2-byte UTF-8 cases. Not tested at all, YMMV.

for (;;) {
    int b = source.read();
    // Single byte character starting with binary 0.
    if ((b & 0x80) == 0)
        return (char) b;
    // 2-byte character starting with binary 110.
    if ((b & 0xE0) == 0xC0)
        return (char) ((b & 0x1F) << 6 | source.read() & 0x3F);
    // 3 and 4 byte encodings left as an exercise...
    // 2nd, 3rd, or 4th byte of a multibyte char starting with 10. 
    // Back up and loop.
    if ((b & 0xC0) == 0xF0) 
        source.seek(source.getFilePosition() - 2);
}

I wouldn't bother with seekPointer. The RandomAccessFile knows what it is; just call getFilePosition when you need it.

Can you please give me an example? I'm trying to make an "algorithm" with the byte checks but it's not going anywhere... — , Feb 18 '17 at 00:04
Well it looks like I've succeeded in creating an algorithm, I'll just do some checks and see if it works completely. — , Feb 18 '17 at 00:25
Yea the `seekPointer` is for other things I'm using, I only included it because I use it in the method. I use it to seek between characters and lines of a file, so I can reference *where* the characters actually are in the file line/position wise. — , Feb 18 '17 at 00:57
You state that "if it starts with 10, its a subsequent byte". So in int-terms, "if it is >= 128, its a subsequent byte". **But**, can it start with 110, or 1110, or 1111 like the first byte can? — , Feb 18 '17 at 01:00
You're right--I'm imagining logic that looks like "if >= 240 then do the 4-byte thing, else if >= 224 do the 3-byte thing, else if >= 192 do the 2-byte thing, else if >= 128 then it's a middle byte, else it's a single-byte character." — Willis Blackburn, Feb 18 '17 at 01:04
Implemented! Now it will check back 3 bytes and if it fails to find a starting position, it will realign itself at the failed byte, and throw a `UTFDataFormatException`. — , Feb 18 '17 at 01:17

marco · Answer 3 · 2017-02-18T01:30:54.673

From the case statement in java.io.DataInputStream.readUTF(DataInput) you can derive something like

public static char readUtf8Char(final DataInput dataInput) throws IOException {
    int char1, char2, char3;

    char1 = dataInput.readByte() & 0xff;
    switch (char1 >> 4) {
        case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
            /* 0xxxxxxx*/
            return (char)char1;
        case 12: case 13:
            /* 110x xxxx   10xx xxxx*/
            char2 = dataInput.readByte() & 0xff;
            if ((char2 & 0xC0) != 0x80) {
                throw new UTFDataFormatException("malformed input");
            }
            return (char)(((char1 & 0x1F) << 6) | (char2 & 0x3F));
        case 14:
            /* 1110 xxxx  10xx xxxx  10xx xxxx */
            char2 = dataInput.readByte() & 0xff;
            char3 = dataInput.readByte() & 0xff;
            if (((char2 & 0xC0) != 0x80) || ((char3 & 0xC0) != 0x80)) {
                throw new UTFDataFormatException("malformed input");
            }
            return (char)(((char1 & 0x0F) << 12) | ((char2 & 0x3F) << 6) | ((char3 & 0x3F) << 0));
        default:
            /* 10xx xxxx,  1111 xxxx */
            throw new UTFDataFormatException("malformed input");
    }
}

Note that RandomAccessFile implements DataInput hence you can pass it to the above method. Before calling it for the first character you need to read an unsigned short which represents the UTF string length.

Note that the encoding used here is modified-UTF-8 as described in the Javadoc of DataInput.

Reading a single UTF-8 character with RandomAccessFile

3 Answers3