Java performance puzzler: wrapper classes faster than primitive types?

Question

in order to implement some image analysis algorithms without having to worry too much on the data type (i.e. without having too much duplicate code), I'm setting up the visitor pattern for primitive arrays in Java.

In the example below, I've defined two types of visitors

a primitive type, where the signature of the visit method is visit(int, int double)
a generic type, where the signature of the visit method is visit(int, int Double).

Appart from this, both visitors do exactly the same operations. My idea was to try and measure the cost of boxing/unboxing.

So here is the full program

public class VisitorsBenchmark {
    public interface Array2DGenericVisitor<TYPE, RET> {

        void begin(int width, int height);

        RET end();

        void visit(int x, int y, TYPE value);
    }

    public interface Array2DPrimitiveVisitor<RET> {

        void begin(final int width, final int height);

        RET end();

        void visit(final int x, final int y, final double value);
    }

    public static <RET>
        RET
        accept(final int width,
               final int height,
               final double[] data,
               final Array2DGenericVisitor<Double, RET> visitor) {

        final int size = width * height;
        visitor.begin(width, height);
        for (int i = 0, x = 0, y = 0; i < size; i++) {
            visitor.visit(x, y, data[i]);
            x++;
            if (x == width) {
                x = 0;
                y++;
                if (y == height) {
                    y = 0;
                }
            }
        }
        return visitor.end();
    }

    public static <RET> RET accept(final int width,
                                   final int height,
                                   final double[] data,
                                   final Array2DPrimitiveVisitor<RET> visitor) {

        final int size = width * height;
        visitor.begin(width, height);
        for (int i = 0, x = 0, y = 0; i < size; i++) {
            visitor.visit(x, y, data[i]);
            x++;
            if (x == width) {
                x = 0;
                y++;
                if (y == height) {
                    y = 0;
                }
            }
        }
        return visitor.end();
    }

    private static final Array2DGenericVisitor<Double, double[]> generic;

    private static final Array2DPrimitiveVisitor<double[]> primitive;

    static {
        generic = new Array2DGenericVisitor<Double, double[]>() {
            private double[] sum;

            @Override
            public void begin(final int width, final int height) {

                final int length = (int) Math.ceil(Math.hypot(WIDTH, HEIGHT));
                sum = new double[length];
            }

            @Override
            public void visit(final int x, final int y, final Double value) {

                final int r = (int) Math.round(Math.sqrt(x * x + y * y));
                sum[r] += value;
            }

            @Override
            public double[] end() {

                return sum;
            }
        };

        primitive = new Array2DPrimitiveVisitor<double[]>() {
            private double[] sum;

            @Override
            public void begin(final int width, final int height) {

                final int length = (int) Math.ceil(Math.hypot(WIDTH, HEIGHT));
                sum = new double[length];
            }

            @Override
            public void visit(final int x, final int y, final double value) {

                final int r = (int) Math.round(Math.sqrt(x * x + y * y));
                sum[r] += value;
            }

            @Override
            public double[] end() {

                return sum;
            }
        };
    }

    private static final int WIDTH = 300;

    private static final int HEIGHT = 300;

    private static final int NUM_ITERATIONS_PREHEATING = 10000;

    private static final int NUM_ITERATIONS_BENCHMARKING = 10000;

    public static void main(String[] args) {

        final double[] data = new double[WIDTH * HEIGHT];
        for (int i = 0; i < data.length; i++) {
            data[i] = Math.random();
        }

        /*
         * Pre-heating.
         */
        for (int i = 0; i < NUM_ITERATIONS_PREHEATING; i++) {
            accept(WIDTH, HEIGHT, data, generic);
        }
        for (int i = 0; i < NUM_ITERATIONS_PREHEATING; i++) {
            accept(WIDTH, HEIGHT, data, primitive);
        }

        /*
         * Benchmarking proper.
         */
        double[] sumPrimitive = null;
        double[] sumGeneric = null;

        double aux = System.nanoTime();
        for (int i = 0; i < NUM_ITERATIONS_BENCHMARKING; i++) {
            sumGeneric = accept(WIDTH, HEIGHT, data, generic);
        }
        final double timeGeneric = System.nanoTime() - aux;

        aux = System.nanoTime();
        for (int i = 0; i < NUM_ITERATIONS_BENCHMARKING; i++) {
            sumPrimitive = accept(WIDTH, HEIGHT, data, primitive);
        }
        final double timePrimitive = System.nanoTime() - aux;

        System.out.println("prim = " + timePrimitive);
        System.out.println("generic = " + timeGeneric);
        System.out.println("generic / primitive = "
                           + (timeGeneric / timePrimitive));
    }
}

I know that the JIT is pretty clever, so I was not too surprised when both visitors turned out to perform equally well. What is more surprising, is that the generic visitor seems to perform slightly faster than the primitive, which is unexpected. I know benchmarking can sometimes be difficult, so I must have done something wrong. Can you spot the error?

Thanks a lot for your help!!! Sébastien

[EDIT] I've updated the code to account for a pre-heating phase (in order to let the JIT compiler do its work). This does not change the results, which are consistently below 1 (0.95 - 0.98).

Passing a Primitive double involves copying 8 bytes on the stack. Passing a Double only takes copying the pointer. — Paul Tomblin, Sep 10 '12 at 12:38
You should put the measured tasks in separate methods and run them a few times until they get compiled (10,000/15,000 should be fine). Then run them in a loop and measure. [This post is a must read](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java). — assylias, Sep 10 '12 at 12:38
If I run the test repeatedly, the difference is between 0.99 and 1.06, the generic being slightly slower. — Peter Lawrey, Sep 10 '12 at 12:44
@Peter: strange! I consistently get a result between 0.95 and 0.98! — Sebastien, Sep 10 '12 at 12:51
@Assylias: good suggestion, I've updated the code accordingly, but the result remains the same. Since each loop is already calling only one method, I guess I don't need to create separate methods. — Sebastien, Sep 10 '12 at 12:54
@Paul: interesting! So presumably, doing the same benchmark with byte/Byte should show the opposite trend. Will give it a go. — Sebastien, Sep 10 '12 at 12:56
@PaulTomblin: I've run the same program with Byte/byte instead of Double/double, and the primitive version is now slightly faster (1.0006348844333905). So I'm ready to accept your comment as an answer, but I'm not sure I can :( — Sebastien, Sep 10 '12 at 13:05
post the JVM and the hardware you run the benchmark w/, plus the JVM options. — bestsss, Sep 10 '12 at 14:27

score 2 · Answer 1 · answered Sep 10 '12 at 12:46

I know benchmarking can sometimes be difficult, so I must have done something wrong. Can you spot the error?

I think that the problem is that your benchmarking does not take account of JVM warmup. Put the take the body of your main method and put it into another method. Then have your main method call that new method repeatedly in a loop. Finally, examine the results, and discard the first few that are distorted by JIT compilation and other warmup effects.

bestsss · Answer 2 · 2012-09-11T09:05:36.130

Small tips:

Do not use Math.random() to perform benchmarks as the results are non-deterministic. You need smth like new Random(xxx).
Always print the result of the operation. Mixing benchmark types in a single execution is bad practice as it can lead to different call site optimizations (not your case, though)
double aux = System.nanoTime(); -- not all longs fit into doubles - properly.
post the specification of the environment and the hardware you perform the benchmarks on
print 'staring test' while enabled printing the compilation -XX:-PrintCompilation and the garbage collection -verbosegc -XX:+PrintGCDetails - the GC can kick in during the 'wrong' test just enough to skew the results.

Edit:

I did check the generated assembler and none of them is the real reason. There is no allocation for Double.valueOf() as the method is inlined altogether and optimized away - it uses only the CPU registers. However w/o the hardware spec/JVM there is no real answer.

I found a JVM (1.6.0.26) where the generic version (Double) has better loop unroll(!), due to deeper analysis (obviously needed to EA the Double.valueOf()) and possibly constant folding of WIDTH/HEIGHT. Change the WIDTH/HEIGHT to some prime numbers and the results should differ.

The bottom line is: do not use microbenchmarks unless you know how the JVM optimizes and check the generated machine code.

_{Disclaimer: I am no JVM engineer}

Thanks for those tips. I'll concentrate on the last one, as I didn't think of this issue. However, I don't think that's the reason for this result, since changing the order of the two loops does not change the result. — Sebastien, Sep 11 '12 at 06:00

score 0 · Answer 3 · answered Sep 10 '12 at 14:50

0

This is a totally "wild assed guess" but I think it has to do with copying bytes onto the stack. Passing a primitive double involves copying 8 bytes on the stack. Passing a Double only takes copying the pointer.

answered Sep 10 '12 at 14:50

Paul Tomblin

179,021
58
319
408

this can't be true - the method is one call-site, i.e. static - the JVM is to inline it for sure. – bestsss Sep 10 '12 at 15:55
If it's not true, why is it faster with byte than Byte, but slower with double than Double? – Paul Tomblin Sep 10 '12 at 16:16
1

Checked the generated assembly `(-server -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly)` - both methods are absolutely inlined and the `Double.valueOf()` is elided (i.e. it doesn't exist at all). Bytes.valueOf() is never allocated btw and always cached. – bestsss Sep 10 '12 at 19:23
This is certainly an interesting point. Paul, I was convinced that you had the answer, but bestsss certainly has a point, doesn't he? I'll keep thinking about it... In any case, the really important answer for me is that boxing does not really matter, timewise. It will make my life much easier to retain the generic version of the visitor, as I want do deal with byte[], float[], long[] and so on as well. – Sebastien Sep 11 '12 at 06:02
2

@Sebastien, BOXing does matter if it cannot be inlined, i.e. have more than one (actually 2) class implementing the inteface and you will see huge difference. One you start having 3 that can be relatively used w/ the same frequency - you will see huge impact as there will no inlining and call-site optimizations any more. Now you have trivially to optimize test case -- and that's the reason not to use microbenchmarks, [they lie](http://www.azulsystems.com/events/javaone_2002/microbenchmarks.pdf). – bestsss Sep 11 '12 at 09:07
Shame I only saw this comment now... I'm using `Number.doubleValue()` a lot. In this case (when I use the interface), I should see a performance loss, if I understand correctly. Shame, that's the path I have followed... – Sebastien Sep 18 '12 at 14:29

Java performance puzzler: wrapper classes faster than primitive types?

3 Answers3