I have observed odd GC behavior when testing Java 19 jkd.incubator.vector
and java.lang.foreign
APIs. Using JMH to avoid OSR compilation artifacts (please let me know if I'm doing this part wrong), I have the following two nearly-identical benchmarks (in an admittedly contrived example):
package org.sample;
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
import java.nio.ByteOrder;
import java.lang.foreign.*;
import java.lang.foreign.ValueLayout;
import jdk.incubator.vector.*;
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
public class MyBenchmark {
static {
// If I remove this line, behavior does not occur
System.setProperty("jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK", "0");
}
final static long MEM_SIZE = Double.BYTES * 4; // Size of SPECIES_256 vectors
final static MemorySegment segment = MemorySegment.ofAddress(MemoryAddress.NULL, Long.MAX_VALUE, MemorySession.global());
final static ByteOrder byteOrder = ByteOrder.nativeOrder();
final long a = MemorySegment.allocateNative(MEM_SIZE, MEM_SIZE, MemorySession.global()).address().toRawLongValue();
final long b = MemorySegment.allocateNative(MEM_SIZE, MEM_SIZE, MemorySession.global()).address().toRawLongValue();
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OperationsPerInvocation(1)
public double hasNoGCActivity() {
ByteVector.fromMemorySegment(ByteVector.SPECIES_256, segment, b, byteOrder)
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles()
.intoMemorySegment(segment, a, byteOrder);
return MemoryAddress.NULL.get(ValueLayout.JAVA_DOUBLE, a);
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OperationsPerInvocation(1)
public double hasSomeGCActivity() {
ByteVector.fromMemorySegment(ByteVector.SPECIES_256, segment, b, byteOrder)
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
.intoMemorySegment(segment, a, byteOrder);
return MemoryAddress.NULL.get(ValueLayout.JAVA_DOUBLE, a);
}
}
which requires
<compilerArgs>
<arg>--enable-preview</arg>
<arg>--add-modules</arg>
<arg>jdk.incubator.vector</arg>
</compilerArgs>
and <javac.target>19</javac.target>
in pom.xml
and can be run with java --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector -verbose:gc -jar ./target/benchmarks.jar
.
The two benchmarks are practically identical, except for the second having one more method call in the chain. Byte reinterpretation should be a no-op at the ASM level, so I would expect these two to execute with identical performance, no matter how many I chain (I alternate between Byte
and Double
because the library is smart enough to do nothing if cast to the same type as the vector currently is). They do, in fact, have identical performance under normal circumstances, but if I disable bounds-checking for memory accesses through the {from, into}MemorySegment
methods with the system property at the top, the second benchmark becomes much slower (one order of magnitude).
The output from -verbose:gc
indicates that there is significant, repeated GC activity occurring in the second method after one or two warmup iterations, which factors into the slowdown. Looking at the generated assembly (not of the OSR compilation), it seems that while in both cases the redundant reinterpretations are optimized out, in the second one there are instructions for setting up a Java double[]
somewhere in memory, at an address that comes from outside of the native method but is presumably to a new allocation.
I would like to know if anyone has any ideas on what is going on here... is there some optimization that is inhibited by the extra invokevirtual
bytecode, which means that some allocations don't get elided? Thank you.
Edit: reading the explanation of allocation removal in the C2 compiler here, it seems to me that this might be related, perhaps by complicating the analysis past some arbitrary limit. However, I have played around with the optimizer settings in a debug build of JDK19 (e.g. incrementing the allowed time for escape analysis [-XX:EscapeAnalysisTimeout=60
]) with no change in program behavior. In fact, using -XX:+PrintEscapeAnalysis
it doesn't look like the benchmark methods themselves are ever subjected to escape analysis.