What's the space-efficient way to read multiple ByteBuffers into a single String?

Question

I'm writing a decoder that will receive a sequence of byte buffers and decode the contents into a single String. There can be any number of byte buffers, each containing any number of bytes. The buffers aren't necessarily split on character boundaries, so depending on the encoding they might contain partial characters at the start or end. Here's what I want to be able to do, where StringByteStreamDecoder is the new class I need to write.

suspend fun decode(data: Flow<ByteBuffer>, charset: Charset): String {
    val decoder = StringByteStreamDecoder(charset)
    data.collect { bytes ->
        decoder.feed(bytes)
    }
    decoder.endOfInput()
    return decoder.toString()
}

Attempt 1

The simplest approach is to collect all the byte buffers into a single byte array. I rejected this approach because it has significant memory overhead. It requires allocating space for the full message at least twice: once for the raw bytes and once for the decoded characters. Here's my simple implementation, using a ByteArrayOutputStream as an expanding byte buffer.

class StringByteStreamDecoder(private val charset: Charset) {
    private val buffer = ByteArrayOutputStream()

    fun feed(data: ByteBuffer) {
        if (data.hasArray()) {
            buffer.write(data.array(), data.position() + data.arrayOffset(), data.remaining())
        } else {
            val array = ByteArray(data.remaining())
            data.get(array)
            buffer.write(array, 0, array.size)
        }
    }

    fun endOfInput() {
        buffer.flush()
    }

    override fun toString(): String {
        return buffer.toString(charset)
    }
}

Attempt 2

To avoid buffering the entire byte stream in memory, I'd like to decode characters on the fly. It's not possible to decode each byte buffer directly to character data, because it might contain partial characters at the start and end. The character decoder (as far as I understand) doesn't have the ability to buffer partial characters, and will only consume complete characters. So, for each incoming byte buffer, my approach is:

Read some data from the incoming byte buffer into a small temporary byte buffer
Decode as many characters as possible from the temporary byte buffer
Repeat until the incoming byte buffer has no remaining data

After all the data has been received, any remaining bytes in the temporary byte buffer can be flushed. This solves the problem with partial characters, provided the temporary byte buffer is at least as big as the charset's widest character.

class StringByteStreamDecoder(charset: Charset, bufferSize: Int = 1024) {
    private val decoder = charset.newDecoder()
    private val tmpBytes = ByteBuffer.allocate(bufferSize)
    private val tmpChars = CharBuffer.allocate((tmpBytes.capacity() * decoder.maxCharsPerByte()).toInt() + 1)
    private val stringBuilder = StringBuilder()

    fun feed(data: ByteBuffer) {
        do {
            tmpBytes.put(data.nextSlice(maxSize = tmpBytes.remaining()))
            flushBytes()
        } while (data.hasRemaining())
    }

    fun endOfInput() {
        flushBytes(endOfInput = true)
    }

    override fun toString(): String = stringBuilder.toString()

    private fun ByteBuffer.nextSlice(maxSize: Int): ByteBuffer {
        val size = minOf(maxSize, remaining())
        val slice = slice(position(), size)
        position(position() + size)
        return slice
    }

    private fun flushBytes(endOfInput: Boolean = false) {
        tmpBytes.flip()
        decoder.decode(tmpBytes, tmpChars, endOfInput)
        tmpBytes.compact()
        flushChars()
    }

    private fun flushChars() {
        tmpChars.flip()
        stringBuilder.append(tmpChars)
        tmpChars.clear()
    }
}

I'm still not completely happy with this approach, because of the extra temporary buffers. I would like to be able to make the temporary byte buffer hold a maximum of one (partial) character. However, if I did that, I'd have to somehow prepend it to the next incoming chunk of data. That would mean allocating a new byte buffer to contain the buffered partial character plus the new incoming data. Copying all the data from the incoming buffer to the concatenated buffer is no more efficient than just using a bigger temporary buffer in the first place.

However, for small strings, the temporary byte buffer represents a significant overhead. I could make the temporary buffer smaller, but that might hurt performance when decoding larger strings.

I also know that the StringBuilder will dynamically resize depending on the input, and might not be the most efficient way to allocate space for the resulting String.

I think I could avoid some of the extra memory allocation if I had access to something like the chain buffer described in this answer. That would allow me to created a concatenated, windowed view of the incoming byte buffers. The character decoder could then consume the concatenated view directly, instead of needing the extra temporary buffer. However, I can't find anything in the standard library that offers that kind of functionality.

Is it possible to solve this problem without allocating extra memory beyond the incoming data and the resulting String itself? If not, what's the minimum amount of extra memory that's needed, and what's the approach that will achieve that minimum?

broot · Answer 1 · 2022-01-27T21:44:40.537

I doubt there is any possibility to create a string in Java without copying the data. Whether we create it from byte[], from char[], we concatenate other strings, we use StringBuilder/StringBuffer - we always have to copy the data. It seems you wrongly assumed that StringBuilder somehow creates strings directly. No, it copies the data in toString(). Potentially, substring() could avoid creating data copies in some JVMs, but I don't know if it's implemented like this in practice.

This is most probably caused by the fact that strings are guaranteed to be immutable, but sources of the data are usually mutable, so copying of the data is necessary.

If you know the size of the data beforehand or you suspect its size, I think the most efficient is to allocate byte array, write everything to it and then convert. So your initial attempt.

If the memory is really such a big concern to you then you can look for JVMs that give you access to some advanced stuff and maybe they allow to create a string from byte[]/char[] without copying. But first of all, you should re-think if you should be really concerned about this. Or maybe this is just a premature optimization.

What's the space-efficient way to read multiple ByteBuffers into a single String?

Attempt 1

Attempt 2

1 Answers1