I'm writing a decoder that will receive a sequence of byte buffers and decode the contents into a single String
. There can be any number of byte buffers, each containing any number of bytes. The buffers aren't necessarily split on character boundaries, so depending on the encoding they might contain partial characters at the start or end. Here's what I want to be able to do, where StringByteStreamDecoder
is the new class I need to write.
suspend fun decode(data: Flow<ByteBuffer>, charset: Charset): String {
val decoder = StringByteStreamDecoder(charset)
data.collect { bytes ->
decoder.feed(bytes)
}
decoder.endOfInput()
return decoder.toString()
}
Attempt 1
The simplest approach is to collect all the byte buffers into a single byte array. I rejected this approach because it has significant memory overhead. It requires allocating space for the full message at least twice: once for the raw bytes and once for the decoded characters. Here's my simple implementation, using a ByteArrayOutputStream
as an expanding byte buffer.
class StringByteStreamDecoder(private val charset: Charset) {
private val buffer = ByteArrayOutputStream()
fun feed(data: ByteBuffer) {
if (data.hasArray()) {
buffer.write(data.array(), data.position() + data.arrayOffset(), data.remaining())
} else {
val array = ByteArray(data.remaining())
data.get(array)
buffer.write(array, 0, array.size)
}
}
fun endOfInput() {
buffer.flush()
}
override fun toString(): String {
return buffer.toString(charset)
}
}
Attempt 2
To avoid buffering the entire byte stream in memory, I'd like to decode characters on the fly. It's not possible to decode each byte buffer directly to character data, because it might contain partial characters at the start and end. The character decoder (as far as I understand) doesn't have the ability to buffer partial characters, and will only consume complete characters. So, for each incoming byte buffer, my approach is:
- Read some data from the incoming byte buffer into a small temporary byte buffer
- Decode as many characters as possible from the temporary byte buffer
- Repeat until the incoming byte buffer has no remaining data
After all the data has been received, any remaining bytes in the temporary byte buffer can be flushed. This solves the problem with partial characters, provided the temporary byte buffer is at least as big as the charset's widest character.
class StringByteStreamDecoder(charset: Charset, bufferSize: Int = 1024) {
private val decoder = charset.newDecoder()
private val tmpBytes = ByteBuffer.allocate(bufferSize)
private val tmpChars = CharBuffer.allocate((tmpBytes.capacity() * decoder.maxCharsPerByte()).toInt() + 1)
private val stringBuilder = StringBuilder()
fun feed(data: ByteBuffer) {
do {
tmpBytes.put(data.nextSlice(maxSize = tmpBytes.remaining()))
flushBytes()
} while (data.hasRemaining())
}
fun endOfInput() {
flushBytes(endOfInput = true)
}
override fun toString(): String = stringBuilder.toString()
private fun ByteBuffer.nextSlice(maxSize: Int): ByteBuffer {
val size = minOf(maxSize, remaining())
val slice = slice(position(), size)
position(position() + size)
return slice
}
private fun flushBytes(endOfInput: Boolean = false) {
tmpBytes.flip()
decoder.decode(tmpBytes, tmpChars, endOfInput)
tmpBytes.compact()
flushChars()
}
private fun flushChars() {
tmpChars.flip()
stringBuilder.append(tmpChars)
tmpChars.clear()
}
}
I'm still not completely happy with this approach, because of the extra temporary buffers. I would like to be able to make the temporary byte buffer hold a maximum of one (partial) character. However, if I did that, I'd have to somehow prepend it to the next incoming chunk of data. That would mean allocating a new byte buffer to contain the buffered partial character plus the new incoming data. Copying all the data from the incoming buffer to the concatenated buffer is no more efficient than just using a bigger temporary buffer in the first place.
However, for small strings, the temporary byte buffer represents a significant overhead. I could make the temporary buffer smaller, but that might hurt performance when decoding larger strings.
I also know that the StringBuilder
will dynamically resize depending on the input, and might not be the most efficient way to allocate space for the resulting String
.
I think I could avoid some of the extra memory allocation if I had access to something like the chain buffer described in this answer. That would allow me to created a concatenated, windowed view of the incoming byte buffers. The character decoder could then consume the concatenated view directly, instead of needing the extra temporary buffer. However, I can't find anything in the standard library that offers that kind of functionality.
Is it possible to solve this problem without allocating extra memory beyond the incoming data and the resulting String
itself? If not, what's the minimum amount of extra memory that's needed, and what's the approach that will achieve that minimum?