I'm not entirely sure what you're trying to do, but let me give you some information that might help.
The UTF-8 encoding represents characters as either 1, 2, 3, or 4 bytes depending on the Unicode value of the character.
- For characters 0x00-0x7F, UTF-8 encodes the character as a single byte. This is a very useful property because if you're only dealing with 7-bit ASCII characters, the UTF-8 and ASCII encodings are identical.
- For characters 0x80-0x7FF, UTF-8 uses 2 bytes: the first byte is binary 110 followed by the 5 high bits of the character, while the second byte is binary 10 followed by the 6 low bits of the character.
- The 3- and 4-byte encodings are similar to the 2-byte encoding, except that the first byte of the 3-byte encoding starts with 1110 and the first byte of the 4-byte encoding starts with 11110.
- See Wikipedia for all the details.
Now this may seem pretty byzantine but the upshot of it is this: you can read any byte in a UTF-8 file and know whether you're looking at a standalone character, the first byte of a multibyte character, or one of the other bytes of a multibyte character.
If the byte you read starts with binary 0, you're looking at a single-byte character. If it starts with 110, 1110, or 11110, then you have the first byte of a multibyte character of 2, 3, or 4 bytes, respectively. If it starts with 10, then it's one of the subsequent bytes of a multibyte character; scan backwards to find the start of it.
So if you want to let your caller seek to any random position in a file and read the UTF-8 character there, you can just apply the algorithm above to find the first byte of that character (if it's not the one at the specified position) and then read and decode the value.
See the Java Charset class for a method to decode UTF-8 from the source bytes. There may be easier ways but Charset will work.
Update: This code should handle the 1- and 2-byte UTF-8 cases. Not tested at all, YMMV.
for (;;) {
int b = source.read();
// Single byte character starting with binary 0.
if ((b & 0x80) == 0)
return (char) b;
// 2-byte character starting with binary 110.
if ((b & 0xE0) == 0xC0)
return (char) ((b & 0x1F) << 6 | source.read() & 0x3F);
// 3 and 4 byte encodings left as an exercise...
// 2nd, 3rd, or 4th byte of a multibyte char starting with 10.
// Back up and loop.
if ((b & 0xC0) == 0xF0)
source.seek(source.getFilePosition() - 2);
}
I wouldn't bother with seekPointer. The RandomAccessFile knows what it is; just call getFilePosition when you need it.