2

I am trying to benchmark the effects false sharing has on the performance of a program.

In this example: https://github.com/lexburner/JMH-samples/blob/master/src/main/java/org/openjdk/jmh/samples/JMHSample_22_FalseSharing.java , cache line padding results in a performance improvement by an order-of-magnitude.

However when I use project panamas Foreign Memory Access API the cache line padding actually makes the performance slightly worse. Do MemorySegments implicitly use padding? What else could be causing this behaviour?

I have already tried to run the benchmarks on different Hardware and to turn off hyperthreading with the same outcome.

Benchmark details:

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(5)
public class JMH_FalseSharing {

    @State(Scope.Group)
    public static class StateBaseline {
        @TearDown(Level.Trial)
        public void tearDown(){
            rs.close();
        }

        ResourceScope rs = ResourceScope.newSharedScope();
        static final VarHandle VH;
        final MemorySegment readSegment = MemorySegment.allocateNative(MemoryLayouts.JAVA_LONG, rs);
        final MemorySegment writeSegment = MemorySegment.allocateNative(MemoryLayouts.JAVA_LONG, rs);

        static{
            VH = MemoryLayouts.JAVA_LONG.varHandle(long.class);
        }
    }
    
    @State(Scope.Group)
    public static class StatePadded {
        @TearDown(Level.Trial)
        public void tearDown(){
            rs.close();
        }
    
        ResourceScope rs = ResourceScope.newSharedScope();
        static final VarHandle VH;
        private static final GroupLayout gl = MemoryLayout.structLayout(
            MemoryLayout.paddingLayout(448L),
            MemoryLayouts.JAVA_LONG.withName("val"),
            MemoryLayout.paddingLayout(448L)
            );
            
        final MemorySegment readSegment  = MemorySegment.allocateNative(gl, rs);
        final MemorySegment writeSegment = MemorySegment.allocateNative(gl, rs);

        static{
            VH = gl.varHandle(long.class, MemoryLayout.PathElement.groupElement("val"));
        }
    }
    
    @Group("baseline")
    @Benchmark
    public void baselineWrite(StateBaseline baselineState){
        StateBaseline.VH.setRelease(baselineState.writeSegment, (long)StateBaseline.VH.getAcquire(baselineState.writeSegment) + 1);
    }

    @Group("baseline")
    @Benchmark
    public void baselineRead(Blackhole blackhole, StateBaseline baselineState){
        blackhole.consume((long)StateBaseline.VH.getAcquire(baselineState.readSegment));
    }
    
    @Group("padded")
    @Benchmark
    public void paddedWrite(StatePadded paddedState){
        StatePadded.VH.setRelease(paddedState.writeSegment, (long)StatePadded.VH.getAcquire(paddedState.writeSegment) + 1);
    }

    @Group("padded")
    @Benchmark
    public void paddedRead(Blackhole blackhole, StatePadded paddedState){
        blackhole.consume((long)StatePadded.VH.getAcquire(paddedState.readSegment));
    }
}
sqlearner
  • 97
  • 5
  • The example you mentioned (JMHSample_22_FalseSharing) makes only plain read and write (with no any membars involved). Your benchmark: 1. uses VarHandle based operations instead of direct access to a field (typically VH operations are optimized, but I'd expect additional perf penalty anyway) 2. introduces membars with getAcquire/setRelease. So, the tests are very different from each other. To make more detailed analysts you may try to use JMH "perfasm" profiler. Or even better https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/tuning-recipes/false-sharing.html – AnatolyG Aug 25 '21 at 10:20
  • Intel VTune can be configured for analysis of a java process easily – AnatolyG Aug 25 '21 at 10:21

0 Answers0