How to make ByteBuffer as efficient as direct byte[] access after JIT has warmed up?

Question

I am trying to optimize a simple decompression routine, and came across this weird performance quirk that I can't seem to find much information on: Manually implemented trivial byte buffers are 10%-20% faster than the built in byte buffers (heap & mapped) for trivial operations (read one byte, read n bytes, is it end of stream)

I tested 3 APIs:

Methods on ByteBuffer.wrap(byte[])
Raw byte[] accesses
Methods on trivial wrapper of byte accesses that (mostly) mirrors the ByteBuffer API

The trivial wrapper:

class TestBuf {
    private final byte[] ary;
    private int pos = 0;

    public TestBuf(ByteBuffer buffer) {  // ctor #1
        ary = new byte[buffer.remaining()];
        buffer.get(ary);
    }
    
    public TestBuf(byte[] inAry) { // ctor #2
        ary = inAry;
    }

    public int readUByte() { return ary[pos++] & 0xFF; }

    public boolean hasRemaining() { return pos < ary.length; }

    public void get(byte[] out, int offset, int length) {
        System.arraycopy(ary, pos, out, offset, length);
        pos += length;
    }
}

The stripped-down core of my main loop is roughly a pattern of:

while (buffer.hasRemaining()) {
    int op = buffer.readUByte();
    if (op == 1) {
        int size = buffer.readUByte();
        buffer.get(outputArray, outputPos, size);
        outputPos += size;
    } // ...
}

I tested the following combos:

native-array: Passing byte[] to a byte[]-accepting method (no copies)
native-testbuf: Passing byte[] to a method that wrapped it in a TestBuf (no copies, ctor #2)
native-buffer: Passing ByteBuffer.wrap(byte[]) to a ByteBuffer-accepting method (no copies)
buffer-array: Passing ByteBuffer.wrap(byte[]) to a method that extracted the ByteBuffer to an array
buffer-testbuf: Passing ByteBuffer.wrap(byte[]) to a method that extracted the ByteBuffer to an array in TestBuf (ctor #1)

I used JMH (blackholing each outputArray), and tested Java 17 on OpenJDK and GraalVM with a decompression corpus of ~5GiB preloaded into RAM, containing ~150,000 items averaging in size from 2KiB to 15MiB. Each corpus took ~10sec to decompress, and the JMH runs had proper warmup and iterations. I did strip the tests down the the minimal necessary non-array code, but even benchmarking the original code this came, from the difference is nearly the same percentage (ie. I don't think there is much else beyond the buffer/array acesses controlling the performance of my original code)

Across several computers the results were a bit jittery, but relatively consistent:

GraalVM was usually slower than OpenJDK by about 10-15% (this surprised me) though the order and relative performance generally stayed the same as OpenJDK
native-array and native-testbuf were the fastest options, tying within the margin of error (and under 0.5%) thanks to the optimizer (9.3s/corpus)
native-buffer was always the slowest option. This was always 17-22% slower than the fastest native-array/native-testbuf (11.4s/corpus)
buffer-array and buffer-testbuf were in the middle of the pack within about 1% of each other, but about 4-7% slower than native-array. However, despite the additional array copy they incurred, they were always significantly faster than native-buffer by about 15-17%. (9.7s/corpus)

Two of these results surprised me the most:

That a wrapped byte array being used via the ByteBuffer API (native-buffer) is so slow compared to a custom simple ByteBuffer-like wrapper (native-testbuf)
That making a whole copy of an array (buffer-*) is still so much faster than using the ByteBuffer.wrap object (native-buffer)

I've tried looking around for information on what I might be doing wrong, but most of the performance questions are about native memory and MappedByteBuffers, whereas I am using HeapByteBuffers, as far as I can tell. Why are HeapByteBuffers so slow compared to my re-implementation for trivial read access? Is there some way I can use HeapByteBuffers more efficiently? Does that also apply for MappedByteBuffer?

Update: I've posted the full benchmark, corpus generator, and algorithms at https://gist.github.com/byteit101/84a3ab8f292de404e122562c7008c133 Note that while trying to get the corpus generator to work, I discovered that my 24-bit number was causing a performance penalty, so added a buffer-bufer target, where copying a buffer to a new buffer and using the new buffer is faster than using the origianl buffer after the 24-bit number.

One run on one of my machines with the generated corpus:

Benchmark                  Mode  Cnt  Score   Error  Units
SOBench.t1_native_array      ss   60  0.891 ± 0.018   s/op
SOBench.t2_buffer_testbuf    ss   60  0.899 ± 0.024   s/op
SOBench.t3_buffer_buffer     ss   60  0.935 ± 0.024   s/op
SOBench.t4_native_buffer     ss   60  1.099 ± 0.024   s/op

Some more recent observations: deleting unused code (see comments in gist) makes ByteBuffer as fast as a native array, as does slight tweaks (changing bitmask conditionals for logical comparisons), so my current theory is that it's some inlining cache miss with something offset related too

Rather than describe the setup of your benchmark in words, can't you just provide the benchmark? It sounds like you've set it up correctly but it's very easy to get wrong. I don't see how anyone could reasonably help with this without reimplementing what you've already done. — Michael, Dec 01 '22 at 15:09
Alas the corpus is proprietary. I will see if I can create some generative corpus or other — byteit101, Dec 01 '22 at 15:24
Well if the specific input matters and you can't provide it then the best you'll get from someone here would be baseless speculation. Let's hope the specific input doesn't matter. — Michael, Dec 01 '22 at 15:27
You didn’t show the `ByteBuffer` based loop. Your API is different, so the code must be different. And how do you process ~5GiB with a byte array that can have at most 2GiB? — Holger, Dec 02 '22 at 08:15
@Holger The corpus is ~5G, but each member of the corpus is processed separately, thus only a few megabytes are required for each. — byteit101, Dec 05 '22 at 08:09
@Michael I've attached a link to a gist with all 5 files required to run this, including a corpus generator — byteit101, Dec 05 '22 at 08:10
The JIT compiler likes smaller methods. So, it’s probably not surprising that removing code or eliminating redundancy improves performance. For real life cases (where the code is not unused), you probably have to split behemoth methods into smaller methods to improve performance. — Holger, Dec 12 '22 at 10:11
@Holger Indeed I tested that with my original code (~4x the size) with each case or loop a method, and those were even worse perf — byteit101, Dec 13 '22 at 17:52

score 2 · Answer 1 · answered Jun 05 '23 at 18:22

I think there is a regression with Java 17. I'm using a lib which processes a String. Several times new copies through #String.Split or #String.getBytes are being created. So I tried out an alternative implementation with ByteBuffer.

With Java 11 this solution is round about 30% faster then the original String based version.

time: 129 vs 180 ns/op
gc.alloc.rate: 2931 vs 3861 MB/sec
gc.count: 300 vs 323
gc.time: 172 vs 178 ms

With Java 17 it changed. The ByteBuffer version deteriorated, the String version improved.

time: 143 vs 146 ns/op
gc.alloc.rate: 2889 vs 4781 MB/sec
gc.count: 426 vs 586
gc.time: 240 vs 305 ms

Even gc.count and gc.time increased.

This is a good find, but using my original benchmark (removing var & other java 17 features) I find that jdk 1.8 is 15% faster for arrays, jdk 11 is 20% faster for arrays, and jdk 17 is 30% faster for arrays. However, the total time also decreased on all three options — byteit101, Jun 10 '23 at 18:45

How to make ByteBuffer as efficient as direct byte[] access after JIT has warmed up?

1 Answers1