1

I know that Java's ByteBuffer.clear() is not really to clean all data in ByteBuffer, so when I StringBuilder.append() string every time, the final result has always appended all remaining chars in ByteBuffer, which are the old data from last write, so how to fix this issues?

int byteRead = -1;
int readCount = 0;
int BUFFER_SIZE = 256;
StringBuilder sb = new StringBuilder();
ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
ReadableByteChannel readableByteChannel = Channels.newChannel(is);
while ((byteRead = readableByteChannel.read(buffer)) > 0 && readCount < 68) {
    sb.append(new String(buffer.array(), "UTF-8"));
    buffer.clear();
    readCount++;
}
Holger
  • 285,553
  • 42
  • 434
  • 765
Andy Chan
  • 277
  • 5
  • 16
  • Eh? Clearing all the data is exactly what it's for. Your problem is that you're ignoring the read length by just building the string directly out of the entire array. – user207421 Jun 14 '16 at 00:07

3 Answers3

6

As already pointed out by other answers, you have to consider the position of the buffer, which gets updated by the read method. So the correct code looks like:

while ((byteRead = readableByteChannel.read(buffer)) > 0 && readCount < 68) {
    sb.append(new String(buffer.array(),
        buffer.arrayOffset(), buffer.arrayOffset()+buffer.position(), "UTF-8"));
    buffer.clear();
    readCount++;
}

Note that in your special case, arrayOffset() will always be zero, but you better write the code in a way, that it doesn’t break when you change something at the buffer allocation code.

But this code is broken. When you read a multiple-byte UTF-8 sequence, it may happen, that the first bytes of that sequence are read in one operation and the remaining bytes are read in the next one. Your attempts to create String instances from these incomplete sequences will produce invalid characters. Besides that, you are creating these String instances, just to copy their contents to a StringBuilder, which is quite inefficient.

So, to do it correctly, you should do something like:

int readCount = 0;
int BUFFER_SIZE = 256;
StringBuilder sb = new StringBuilder();
CharsetDecoder dec=StandardCharsets.UTF_8.newDecoder();
ByteBuffer buffer = ByteBuffer.allocate(BUFFER_SIZE);
CharBuffer cBuffer= CharBuffer.allocate(BUFFER_SIZE);
ReadableByteChannel readableByteChannel = Channels.newChannel(is);
while(readableByteChannel.read(buffer) > 0 && readCount < 68) {
    buffer.flip();
    while(dec.decode(buffer, cBuffer, false).isOverflow()) {
        cBuffer.flip();
        sb.append(cBuffer);
        cBuffer.clear();
    }
    buffer.compact();
    readCount++;
}
buffer.flip();
for(boolean more=true; more; ) {
    more=dec.decode(buffer, cBuffer, true).isOverflow();
    cBuffer.flip();
    sb.append(cBuffer);
    cBuffer.clear();
}

Note, how both, ReadableByteChannel and CharsetDecoder process the buffers using their positions and limits. All you have to do, is to use flip and compact correctly as shown in the documentation of compact.

The only exception is the appending to the Stringbuilder, as that’s not an NIO function. There, we have to use clear(), as we know that the Stringbuilder.append operation does consume all characters from the buffer.

Note that this code still does not deal with certain (unavoidable) error conditions, since you stop after an arbitrary number of reads, it’s always possible that you cut in the middle of a multi-byte UTF-8 sequence.


But this quite complicated logic has been implemented by the JRE already and if you give up the idea of cutting after a certain number of bytes, you can utilize that:

int readCount = 0;
int BUFFER_SIZE = 256;
StringBuilder sb = new StringBuilder();
CharBuffer cBuffer= CharBuffer.allocate(BUFFER_SIZE);
ReadableByteChannel readableByteChannel = Channels.newChannel(is);
Reader reader=Channels.newReader(readableByteChannel, "UTF-8");
while(reader.read(cBuffer) > 0 && readCount < 68) {
    cBuffer.flip();
    sb.append(cBuffer);
    cBuffer.clear();
    readCount++;
}

Now this code will limit the reading to 256 × 68 characters rather than bytes, but for UTF-8 encoded data, this makes a difference only when there are multi-byte sequences, about which you apparently didn’t care before.

Finally, since you apparently have an InputStream in the first place, you don’t need the ReadableByteChannel detour at all:

int readCount = 0;
int BUFFER_SIZE = 256;
StringBuilder sb = new StringBuilder();
CharBuffer cBuffer = CharBuffer.allocate(BUFFER_SIZE);
Reader reader = new InputStreamReader(is, StandardCharsets.UTF_8);
while(reader.read(cBuffer) > 0 && readCount < 68) {
    cBuffer.flip();
    sb.append(cBuffer);
    cBuffer.clear();
    readCount++;
}

This might look like “not being NIO code”, but Readers are still the canonical way of reading character data, even with NIO; there’s no replacement. The method Reader.read(CharBuffer) was missing in the first release of NIO, but handed in with Java 5.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • Why do not use BufferReader? Which performance better ReadableByteChannel vs Reader vs BufferReader? – Andy Chan Jun 13 '16 at 12:34
  • Why use `BufferedReader`? A `BufferedReader` makes only sense, if you are repeatedly using its method for reading a single char or using a buffer smaller than the `BufferedReader`’s own buffer. In other words, it makes no sense. If your buffer is too small, just use a bigger buffer. And if you are concerned about performance, you don’t use the single char read method. The only remaining useful feature of `BufferedReader` is to allow processing of lines, which is irrelevant here. – Holger Jun 13 '16 at 12:38
  • It seems to be not-so widely known, but the `Buffered…` variants of `InputStream` and `Reader` will do nothing, if you pass in a sufficiently large buffer. If the buffer is only a bit smaller than their own buffer, performance may even get worse, due to unnecessary copying. Consequently, the `Channel` API does not provide buffered channel variants, as with channels, you always have to provide your own buffer. – Holger Jun 13 '16 at 12:45
  • Manually dealing with `ReadableByteChannel` here, only raises the chance of errors, while a plain `Reader` will do best, regardless of whether you initialize it via `InputStream` or `ReadableByteChannel`. You may consider using a `CharBuffer` larger than `256` here, just try with different sizes and measure. Or you use a `CharBuffer` large enough to hold the entire data instead and skip the copying to the `StringBuilder` if possible. `CharBuffer` implements the `CharSequence` interface so, depending on what you actually want to do, might be a drop-in replacement for the `StringBuilder`. – Holger Jun 13 '16 at 12:50
0

Use position() to get the current buffer position and get part of the array with Arrays.copyOf:

Arrays.copyOf(buffer.array(), 0, buffer.position());

Which will become in your case:

sb.append(new String(Arrays.copyOf(buffer.array(), 0, buffer.position()), "UTF-8"));

Or even shorter when using appropriate String constructor:

sb.append(new String(buffer.array(), 0, buffer.position(), "UTF-8"));

Or probably what you were looking for using slice(): sb.append(new String(buffer.slice().array(), "UTF-8"));

BTW. Instead of "UTF-8" it is better to use StandardCharsets.UTF_8.

Krzysztof Krasoń
  • 26,515
  • 16
  • 89
  • 115
0

You can use the new String(byte[] bytes, int offset, int length, String charsetName()) constructor.

new String(buffer.array(), 0, byteRead, "UTF-8"); 

This will prevent the previous data from being used when the new String is created.

Titus
  • 22,031
  • 1
  • 23
  • 33