Inconsistency of Igor Ostrovsky's Example 2 in his blog "Gallery of Processor Cache Effects" for Apple M1 Silicon CPU

Question

I replicated the Example 2 mentioned in the Igor Ostrovsky's blog "Gallery of Processor Cache Effects" using a minimal reproducible example (MRE) as to measure the runtime for each K. The K in the blog is same as the variable STEP in the following MRE:

import java.io.FileWriter;
import java.io.IOException;
import java.util.Map;
import java.util.TreeMap;

public class DemonstrateCachingAndDataPrefetching {


    public static void main(String[] args) throws IOException {

        //Fetching environment from command line arguments.
        String environment = args[0];

        //Integer array of predefined size (chosen randomly) and acts as the prime data for the experiment.
        int[] arr = new int[64 * 1024 * 1024];

        //Domain for the values of STEP
        int[] STEP_DOMAIN = new int[]{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096};

        //Map to store the cumulative runtime for each value of STEP.
        TreeMap<Integer, Long> map = new TreeMap<Integer, Long>();

        //Number of times the runtime is computed for each value of STEP and then is averaged out.
        int RUNS = 1000;

        for (int STEP : STEP_DOMAIN){

            //Initializing the cumulative time for each value of STEP to 0.
            map.put(STEP, (long)0);

            for(int run = 0; run < RUNS; run++){

                long startTime = System.nanoTime();

                    //Generalized Task
                    for (int i = 0; i < arr.length; i += STEP){
                        arr[i] *= 3;
                    }

                long estimatedTime = System.nanoTime() - startTime;

                long cumulativeTimeForStep = map.get(STEP);
                cumulativeTimeForStep += estimatedTime;
                map.put(STEP, cumulativeTimeForStep);
            }
        }

        FileWriter myWriter = new FileWriter(String.format("results-%1$s.csv", environment));
        myWriter.write("STEP,RUNTIME\n");
        for(Map.Entry m:map.entrySet()){

            Double averageValue = Double.parseDouble(m.getValue()+"");
            averageValue /= (double)RUNS;
            myWriter.write(m.getKey() + "," + String.format("%.02f", averageValue) + "\n");
        }
        myWriter.close();
    }
}

I checked the cacheline size of the Apple M1 CPU using sysctl. It returned 128 Bytes, as shown in the below picture:

QUESTION:

As, the cacheline size is 128 Bytes and size of integer in JAVA is 4 Bytes, the runtime should have been nearly same for the values of the variable STEP = {4, 8, 16, 32} and 64 if data perfetching is supported.

But, the runtime value for STEP = 32 was significantly greater than the runtime for the values of variable STEP = {8, 16, 64}, as shown in the following image:

Had it been an Intel CPU, the runtime for the values of the variable STEP = {8, 16, 32, 64} would have been nearly same (as cacheline size is 128 Bytes). But, this is not the case with the Apple M1 CPU.

Any hint about this inconsistency is appreciated.

not very sure what this is supposed to do, but you are measuring some time for some operations? If so, start using `jmh`. — Eugene, Feb 01 '21 at 19:25
Thanks @Eugene. I will surely consider your suggestion. Can you also let me know other benchmarks like `jmh` which are currently accepted as per industry standards and used for C and C++. — Deepak Tatyaji Ahire, Feb 02 '21 at 12:36
Maybe JIT consume processor time when STEP is 32, try to disable or log JIT(Hotspot/J9) operations. M1 is a ARM architecture, it is different than Intel. Can you repeat test on another ARM processor? — Alag, Feb 05 '21 at 18:58

Inconsistency of Igor Ostrovsky's Example 2 in his blog "Gallery of Processor Cache Effects" for Apple M1 Silicon CPU

0 Answers0