Unexpected Garbage Collector activity when disabling vector access bounds checking

Question

I have observed odd GC behavior when testing Java 19 jkd.incubator.vector and java.lang.foreign APIs. Using JMH to avoid OSR compilation artifacts (please let me know if I'm doing this part wrong), I have the following two nearly-identical benchmarks (in an admittedly contrived example):

package org.sample;

import org.openjdk.jmh.annotations.*;

import java.util.concurrent.TimeUnit;

import java.nio.ByteOrder;
import java.lang.foreign.*;
import java.lang.foreign.ValueLayout;

import jdk.incubator.vector.*;

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
public class MyBenchmark {
    static {
        // If I remove this line, behavior does not occur
        System.setProperty("jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK", "0");
    }

    final static long MEM_SIZE = Double.BYTES * 4;  // Size of SPECIES_256 vectors
    final static MemorySegment segment = MemorySegment.ofAddress(MemoryAddress.NULL, Long.MAX_VALUE, MemorySession.global());
    final static ByteOrder byteOrder = ByteOrder.nativeOrder();
    final long a = MemorySegment.allocateNative(MEM_SIZE, MEM_SIZE, MemorySession.global()).address().toRawLongValue();
    final long b = MemorySegment.allocateNative(MEM_SIZE, MEM_SIZE, MemorySession.global()).address().toRawLongValue();

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OperationsPerInvocation(1)
    public double hasNoGCActivity() {
        ByteVector.fromMemorySegment(ByteVector.SPECIES_256, segment, b, byteOrder)
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles()
            .intoMemorySegment(segment, a, byteOrder);
        return MemoryAddress.NULL.get(ValueLayout.JAVA_DOUBLE, a);
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OperationsPerInvocation(1)
    public double hasSomeGCActivity() {
        ByteVector.fromMemorySegment(ByteVector.SPECIES_256, segment, b, byteOrder)
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .reinterpretAsDoubles().reinterpretAsBytes().reinterpretAsDoubles().reinterpretAsBytes()
            .intoMemorySegment(segment, a, byteOrder);
        return MemoryAddress.NULL.get(ValueLayout.JAVA_DOUBLE, a);
    }
}

which requires

<compilerArgs>
    <arg>--enable-preview</arg>
    <arg>--add-modules</arg>
    <arg>jdk.incubator.vector</arg>
</compilerArgs>

and <javac.target>19</javac.target> in pom.xml and can be run with java --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector -verbose:gc -jar ./target/benchmarks.jar.

The two benchmarks are practically identical, except for the second having one more method call in the chain. Byte reinterpretation should be a no-op at the ASM level, so I would expect these two to execute with identical performance, no matter how many I chain (I alternate between Byte and Double because the library is smart enough to do nothing if cast to the same type as the vector currently is). They do, in fact, have identical performance under normal circumstances, but if I disable bounds-checking for memory accesses through the {from, into}MemorySegment methods with the system property at the top, the second benchmark becomes much slower (one order of magnitude).

The output from -verbose:gc indicates that there is significant, repeated GC activity occurring in the second method after one or two warmup iterations, which factors into the slowdown. Looking at the generated assembly (not of the OSR compilation), it seems that while in both cases the redundant reinterpretations are optimized out, in the second one there are instructions for setting up a Java double[] somewhere in memory, at an address that comes from outside of the native method but is presumably to a new allocation.

I would like to know if anyone has any ideas on what is going on here... is there some optimization that is inhibited by the extra invokevirtual bytecode, which means that some allocations don't get elided? Thank you.

Edit: reading the explanation of allocation removal in the C2 compiler here, it seems to me that this might be related, perhaps by complicating the analysis past some arbitrary limit. However, I have played around with the optimizer settings in a debug build of JDK19 (e.g. incrementing the allowed time for escape analysis [-XX:EscapeAnalysisTimeout=60]) with no change in program behavior. In fact, using -XX:+PrintEscapeAnalysis it doesn't look like the benchmark methods themselves are ever subjected to escape analysis.

All compilers have limitations, especially JIT compilers that need to be very fast and moderate in resource consumption. Would you really expect C2 compiler to deal with infinitely complex expressions? Probably not. So there must be a limit somewhere, and you empirically found the boundary. Why do you call it "unexpected" then? — apangin, Feb 04 '23 at 22:27
Perhaps "unexpected" is not the right word. However, as this limit is arbitrary, I would like to understand how to modify it if necessary for my application. None of the flags I've found so far in the C2 compiler have changed the behavior. — Francisco O., Feb 06 '23 at 19:01

Unexpected Garbage Collector activity when disabling vector access bounds checking

0 Answers0