ByteBuffer.putLong ~2x faster with non-native ByteOrder

Question

Here's a result I can't wrap by head around, despite extensive reading of the JDK source and the examination of intrinsic routines.

I'm testing clearing out a ByteBuffer, allocated with allocateDirect using ByteBuffer.putLong(int index, long value). Based on the JDK code, this results in a single 8 byte write if the buffer is in "native byte order", or a byte swap, followed by the same if it isn't.

So I'd expect native byte order (little endian for me) to be at least as fast as non-native. as it turns out, however, non-native are ~2x faster.

Here's my benchmark in Caliper 0.5x:

...    

public class ByteBufferBench extends SimpleBenchmark {

    private static final int SIZE = 2048;

    enum Endian {
        DEFAULT,
        SMALL,
        BIG
    }

    @Param Endian endian;

    private ByteBuffer bufferMember; 

    @Override
    protected void setUp() throws Exception {
        super.setUp();
        bufferMember = ByteBuffer.allocateDirect(SIZE);
        bufferMember.order(endian == Endian.DEFAULT ? bufferMember.order() :
            (endian == Endian.SMALL ? ByteOrder.LITTLE_ENDIAN : ByteOrder.BIG_ENDIAN));
    }

    public int timeClearLong(int reps) {
        ByteBuffer buffer = bufferMember;
        while (reps-- > 0) {
            for (int i=0; i < SIZE / LONG_BYTES; i+= LONG_BYTES) {
                buffer.putLong(i, reps);
            }
        }
        return 0;
    }

    public static void main(String[] args) {
        Runner.main(ByteBufferBench.class,args);
    }

}

The results are:

benchmark       type  endian     ns linear runtime
ClearLong     DIRECT DEFAULT   64.8 =
ClearLong     DIRECT   SMALL  118.6 ==
ClearLong     DIRECT     BIG   64.8 =

That's consistent. If I swap putLong for putFloat, it's about 4x faster for native order. If you look at how putLong works, it's doing absolutely more work in the non-native case:

private ByteBuffer putLong(long a, long x) {
    if (unaligned) {
        long y = (x);
        unsafe.putLong(a, (nativeByteOrder ? y : Bits.swap(y)));
    } else {
        Bits.putLong(a, x, bigEndian);
    }
    return this;
}

Note that unaligned is true in either case. The only difference between native and non-native byte order is Bits.swap which favors the native case (little-endian).

You're using only a part (about 1/8) of the buffer as `putLong` expects the offset in bytes. I can't see why the access should be unaligned (when you fix the byte vs. long offset thingy). My [results](https://microbenchmarks.appspot.com/runs/0bd9f0ea-96d4-4cfd-97ce-105a3ccc9a1d) (created via caliper 1.0 beta) differ. — maaartinus, Oct 09 '13 at 07:58
@Kayaman - I'm on a Xeon W3580, but I don't expect it differs across x86 architectures. — BeeOnRope, Oct 09 '13 at 08:11
@maaartinus - You are right, good catch. I've fixed the benchmark, and the anomaly remains (see updated numbers and benchmark code in the post). — BeeOnRope, Oct 09 '13 at 08:19
@maaartinus - my new conclusion is that `DirectByteBuffer` is better for almost everything, often by nearly an order of magnitude — BeeOnRope, Oct 09 '13 at 08:26
Just yesterday I encountered this same phenomenon. Was going to pose the question to SO, but you beat me to the punch (and did the hard work for me). :-) I'm using Java Microbenchmark Harness and JDK8. — Andrew Bissell, Oct 09 '13 at 17:26
For an operation involving unpacking bytes from a long, I've noticed that `DirectByteBuffer` with non-native order is faster even than `Unsafe` (which makes sense given that `Unsafe` uses native order). — Andrew Bissell, Oct 09 '13 at 17:46
Reviewed my own benchmark where I thought I observed similar behavior and found that I was mixing orders. I was converting bytes->(little-endian)->long->(big-endian)->bytes, which ran faster than bytes->(little-endian)->long->(little-endian)->bytes, but yielded meaningless results. — Andrew Bissell, Oct 10 '13 at 05:04

score 4 · Answer 1 · answered Oct 11 '13 at 13:52

To summarize the discussion from the mechanical sympathy mailing list:

1.The anomaly described by the OP was not reproduce-able on my setup (JDK7u40/Ubuntu13.04/i7) resulting in consistent performance for both heap and direct buffers on all cases, with direct buffer offering a massive performance advantage:

BYTE_ARRAY DEFAULT 211.1 ==============================
BYTE_ARRAY   SMALL 199.8 ============================
BYTE_ARRAY     BIG 210.5 =============================
DIRECT DEFAULT  33.8 ====
DIRECT   SMALL  33.5 ====
DIRECT     BIG  33.7 ====

The Bits.swap(y) method gets intrinsic-fied into a single instruction and so can't/shouldn't really account for much of a difference/overhead.

2.The above result (i.e. contradictory to the OP experience) was independently confirmed by a naive hand rolled benchmark and a JMH benchmark written by another participant.

This leads me to believe you are either experiencing some local issue or some sort of a benchmarking framework issue. It would be valuable if others could run the experiment and see if they can reproduce your result.

score 0 · Answer 2 · answered Oct 09 '13 at 08:32

0

The default is big endian even on little endian systems. Can you try ByteOrder.nativeOrder() as this should be faster for you.

direct ByteBuffers are faster for IO as heap buffers have to be copied to / from a direct buffer.

Btw you might like to compare this to using Unsafe directly as this does have the bounds check to see how much difference it makes.

answered Oct 09 '13 at 08:32

Peter Lawrey

525,659
79
751
1,130

I understand. My comment was that big endian (deault for `ByteBuffer`) was faster on my little endian system (in practice, 99% of SO posters, and me). This makes no sense based on my reading of the code. – BeeOnRope Oct 09 '13 at 08:40
Agreed. Need to try to reproduce when I get home. Some of the operations were the same. – Peter Lawrey Oct 09 '13 at 17:21

ByteBuffer.putLong ~2x faster with non-native ByteOrder

2 Answers2