I am trying to optimize a simple decompression routine, and came across this weird performance quirk that I can't seem to find much information on: Manually implemented trivial byte buffers are 10%-20% faster than the built in byte buffers (heap & mapped) for trivial operations (read one byte, read n bytes, is it end of stream)
I tested 3 APIs:
- Methods on
ByteBuffer.wrap(byte[])
- Raw byte[] accesses
- Methods on trivial wrapper of byte accesses that (mostly) mirrors the ByteBuffer API
The trivial wrapper:
class TestBuf {
private final byte[] ary;
private int pos = 0;
public TestBuf(ByteBuffer buffer) { // ctor #1
ary = new byte[buffer.remaining()];
buffer.get(ary);
}
public TestBuf(byte[] inAry) { // ctor #2
ary = inAry;
}
public int readUByte() { return ary[pos++] & 0xFF; }
public boolean hasRemaining() { return pos < ary.length; }
public void get(byte[] out, int offset, int length) {
System.arraycopy(ary, pos, out, offset, length);
pos += length;
}
}
The stripped-down core of my main loop is roughly a pattern of:
while (buffer.hasRemaining()) {
int op = buffer.readUByte();
if (op == 1) {
int size = buffer.readUByte();
buffer.get(outputArray, outputPos, size);
outputPos += size;
} // ...
}
I tested the following combos:
native-array
: Passingbyte[]
to abyte[]
-accepting method (no copies)native-testbuf
: Passingbyte[]
to a method that wrapped it in aTestBuf
(no copies, ctor #2)native-buffer
: PassingByteBuffer.wrap(byte[])
to a ByteBuffer-accepting method (no copies)buffer-array
: PassingByteBuffer.wrap(byte[])
to a method that extracted the ByteBuffer to an arraybuffer-testbuf
: PassingByteBuffer.wrap(byte[])
to a method that extracted the ByteBuffer to an array inTestBuf
(ctor #1)
I used JMH (blackholing each outputArray), and tested Java 17 on OpenJDK and GraalVM with a decompression corpus of ~5GiB preloaded into RAM, containing ~150,000 items averaging in size from 2KiB to 15MiB. Each corpus took ~10sec to decompress, and the JMH runs had proper warmup and iterations. I did strip the tests down the the minimal necessary non-array code, but even benchmarking the original code this came, from the difference is nearly the same percentage (ie. I don't think there is much else beyond the buffer/array acesses controlling the performance of my original code)
Across several computers the results were a bit jittery, but relatively consistent:
- GraalVM was usually slower than OpenJDK by about 10-15% (this surprised me) though the order and relative performance generally stayed the same as OpenJDK
native-array
andnative-testbuf
were the fastest options, tying within the margin of error (and under 0.5%) thanks to the optimizer (9.3s/corpus)native-buffer
was always the slowest option. This was always 17-22% slower than the fastestnative-array
/native-testbuf
(11.4s/corpus)buffer-array
andbuffer-testbuf
were in the middle of the pack within about 1% of each other, but about 4-7% slower thannative-array
. However, despite the additional array copy they incurred, they were always significantly faster thannative-buffer
by about 15-17%. (9.7s/corpus)
Two of these results surprised me the most:
- That a wrapped byte array being used via the ByteBuffer API (
native-buffer
) is so slow compared to a custom simple ByteBuffer-like wrapper (native-testbuf
) - That making a whole copy of an array (
buffer-*
) is still so much faster than using theByteBuffer.wrap
object (native-buffer
)
I've tried looking around for information on what I might be doing wrong, but most of the performance questions are about native memory and MappedByteBuffers
, whereas I am using HeapByteBuffers
, as far as I can tell. Why are HeapByteBuffers
so slow compared to my re-implementation for trivial read access? Is there some way I can use HeapByteBuffers
more efficiently? Does that also apply for MappedByteBuffer
?
Update: I've posted the full benchmark, corpus generator, and algorithms at https://gist.github.com/byteit101/84a3ab8f292de404e122562c7008c133 Note that while trying to get the corpus generator to work, I discovered that my 24-bit number was causing a performance penalty, so added a buffer-bufer
target, where copying a buffer to a new buffer and using the new buffer is faster than using the origianl buffer after the 24-bit number.
One run on one of my machines with the generated corpus:
Benchmark Mode Cnt Score Error Units
SOBench.t1_native_array ss 60 0.891 ± 0.018 s/op
SOBench.t2_buffer_testbuf ss 60 0.899 ± 0.024 s/op
SOBench.t3_buffer_buffer ss 60 0.935 ± 0.024 s/op
SOBench.t4_native_buffer ss 60 1.099 ± 0.024 s/op
Some more recent observations: deleting unused code (see comments in gist) makes ByteBuffer as fast as a native array, as does slight tweaks (changing bitmask conditionals for logical comparisons), so my current theory is that it's some inlining cache miss with something offset related too