Here's a result I can't wrap by head around, despite extensive reading of the JDK source and the examination of intrinsic routines.
I'm testing clearing out a ByteBuffer
, allocated with allocateDirect
using ByteBuffer.putLong(int index, long value)
. Based on the JDK code, this results in a single 8 byte write if the buffer is in "native byte order", or a byte swap, followed by the same if it isn't.
So I'd expect native byte order (little endian for me) to be at least as fast as non-native. as it turns out, however, non-native are ~2x faster.
Here's my benchmark in Caliper 0.5x:
...
public class ByteBufferBench extends SimpleBenchmark {
private static final int SIZE = 2048;
enum Endian {
DEFAULT,
SMALL,
BIG
}
@Param Endian endian;
private ByteBuffer bufferMember;
@Override
protected void setUp() throws Exception {
super.setUp();
bufferMember = ByteBuffer.allocateDirect(SIZE);
bufferMember.order(endian == Endian.DEFAULT ? bufferMember.order() :
(endian == Endian.SMALL ? ByteOrder.LITTLE_ENDIAN : ByteOrder.BIG_ENDIAN));
}
public int timeClearLong(int reps) {
ByteBuffer buffer = bufferMember;
while (reps-- > 0) {
for (int i=0; i < SIZE / LONG_BYTES; i+= LONG_BYTES) {
buffer.putLong(i, reps);
}
}
return 0;
}
public static void main(String[] args) {
Runner.main(ByteBufferBench.class,args);
}
}
The results are:
benchmark type endian ns linear runtime
ClearLong DIRECT DEFAULT 64.8 =
ClearLong DIRECT SMALL 118.6 ==
ClearLong DIRECT BIG 64.8 =
That's consistent. If I swap putLong
for putFloat
, it's about 4x faster for native order. If you look at how putLong
works, it's doing absolutely more work in the non-native case:
private ByteBuffer putLong(long a, long x) {
if (unaligned) {
long y = (x);
unsafe.putLong(a, (nativeByteOrder ? y : Bits.swap(y)));
} else {
Bits.putLong(a, x, bigEndian);
}
return this;
}
Note that unaligned
is true in either case. The only difference between native and non-native byte order is Bits.swap
which favors the native case (little-endian).