SeekableByteChannel russian chars

Question

Recently I've started learning about java.nio. And I have an example in my textbook how to read text file with SeekableByteChannel:

int count;
Path path;

try {
    path = Paths.get(System.getProperty("user.home") + "/desktop/text.txt");
} catch (InvalidPathException e) {
    out.println(e.getMessage());
    return;
}

try (SeekableByteChannel channel = Files.newByteChannel(path)) {
    ByteBuffer buffer = ByteBuffer.allocate(128);

    do {
        count = channel.read(buffer);

        if (count != -1) {
            buffer.rewind();
            for (int i = 0; i < count; i++)
                out.print((char) buffer.get());
        }
    } while (count != -1);

} catch (IOException e) {
    out.println("File not found!!!");
}

out.flush();

So I've made a text file with english and russian words in it using an ANSI encoding. And this is what I get:

Method buffer.get() returns byte value and russian characters start from somewhere 1000. So I've changed encoding to UTF-8 and used another method:

for (int i = 0; i < count; i += 2)
    out.print(buffer.getChar()); //reads 2 bytes and converts them to char

But this gives me a line of question marks.

So does anyone know how to properly read russian text using SeekableByteChannel?

text.txt could be nailed better for the example. – bohdan_trotsenko Nov 30 '17 at 09:43 — bohdan_trotsenko, Nov 30 '17 at 09:43

score 1 · Accepted Answer · answered Sep 03 '15 at 17:48

The method getChar() of ByteBuffer reads two bytes and interprets them as high byte and low byte of a char, in other words, invariably uses the UTF-16 encoding. Generally, you shouldn’t try to puzzle bytes to Strings manually, not with the old I/O API and not with NIO. Just to mention one thing you would have to deal with when trying to decode bytes from a buffer manually, is that the bytes in your buffer may not end at a character boundary for multi-byte encodings.

If you want to read text from a SeekableByteChannel, you may use Channels.newReader(…) to construct a Reader using the specified charset to decode the bytes.

But of course, you can skip the Channel stuff entirely and use Files.newBufferedReader(…) to create a Reader right from the Path.

By the way, the example code is questionable, even for reading a sequence of bytes. Here is a simplified example:

Path path=Paths.get(System.getProperty("user.home")).resolve("desktop/text.txt");
try(FileChannel channel=FileChannel.open(path)) {
  ByteBuffer buffer = ByteBuffer.allocate(128);
  while(channel.read(buffer)!=-1) {
    buffer.flip();
    while(buffer.hasRemaining())
        System.out.printf("%02x ", buffer.get());
    buffer.clear();
    System.out.println();
  }
} catch (IOException e) {
    System.out.println(e.toString());
}

A ByteBuffer knows how many bytes it contains (i.e. have been put into it by the read operation). With flip you prepare the buffer for reading them out, e.g. with a loop like in the example or by writing into another channel. When you know that you have processed the entire contents, you can use clear to set the buffer in the initial state where it can be filled from beginning up to the end.

Otherwise, if it may contain unprocessed data, use compact instead, this will move the unprocessed data to the beginning of the buffer and prepare it for receiving more data after them, so after a subsequent read and flip you have the pending data of the previous iteration followed by the data of the most recent read operation ready for being processed as a single sequence. (This is how the Reader will deal with incomplete character sequences internally while decoding)

SeekableByteChannel russian chars

1 Answers1