28

I'm curious about the following construct in Java 8:

double[] doubles = //...
double sum = DoubleStream.of(doubles).parallel().sum();

To cut to the chase:

  • Will the value of sum always be the same, e.g. when run on different computers?

More background...

Floating point arithmetic is lossy and (unlike real-valued arithmetic) is not associative. So unless care is taken in how the work is divided and reassembled, it could lead to non-deterministic results.

I was happy to discover that the sum() method employs Kahan Summation under the hood. This significantly reduces the error, but does still not give precise* results.

In my testing repeated calls appear to return the same result each time, but I'd like to know how stable we can safely assume it is. e.g.:

  1. Stable in all circumstances?
  2. Stable across computers with the same number of cores?
  3. Stable only on a given computer?
  4. Can't depend on it being stable at all?

I'm happy to assume the same JVM version on each computer.

Here's a test I whipped up:

public static void main(String[] args) {
    Random random = new Random(42L);
    for (int j = 1; j < 20; j++) {

        // Stream increases in size and the magnitude of the values at each iteration.
        double[] doubles = generate(random, j*100, j);

        // Like a simple for loop
        double sum1 = DoubleStream.of(doubles).reduce(0, Double::sum); 

        double sum2 = DoubleStream.of(doubles).sum();
        double sum3 = DoubleStream.of(doubles).parallel().sum();

        System.out.println(printStats(doubles, sum1, sum2, sum3));

        // Is the parallel computation stable?
        for (int i = 0; i < 1000; i++) {
            double sum4 = DoubleStream.of(doubles).parallel().sum();
            assert sum4 == sum3;
        }
        Arrays.sort(doubles);
    }
}

/**
 * @param spread When odd, returns a mix of +ve and -ve numbers.
 *               When even, returns only +ve numbers.
 *               Higher values cause a wider spread of magnitudes in the returned values.
 *               Must not be negative.  
 */
private static double[] generate(Random random, int count, int spread) {
    return random.doubles(count).map(x -> Math.pow(4*x-2, spread)).toArray();
}

private static String printStats(double[] doubles, double sum1, double sum2, double sum3) {
    DoubleSummaryStatistics stats = DoubleStream.of(doubles).summaryStatistics();

    return String.format("-----%nMin: %g, Max: %g, Average: %g%n"
            + "Serial difference:   %g%n"
            + "Parallel difference: %g",
            stats.getMin(), stats.getMax(), stats.getAverage(), sum2-sum1, sum3-sum1);
}

When I run this, the first few iterations are:

-----
Min: -1.89188, Max: 1.90414, Average: 0.0541140
Serial difference:   -2.66454e-15
Parallel difference: -2.66454e-15
-----
Min: 0.000113827, Max: 3.99513, Average: 1.17402
Serial difference:   1.70530e-13
Parallel difference: 1.42109e-13
-----
Min: -7.95673, Max: 7.87757, Average: 0.0658356
Serial difference:   0.00000
Parallel difference: -7.10543e-15
-----
Min: 2.53794e-09, Max: 15.8122, Average: 2.96504
Serial difference:   -4.54747e-13
Parallel difference: -6.82121e-13

Notice that while sum2 & sum3 can be assumed to be more accurate than sum1 - they might not be the same as each other!

I seeded Random with 42, so if anyone gets a different result to me, that would immediately prove some something. :-)


* For the curious...

  • Here are some (python) algorithms that give precise results
  • The precise-sum algorithm with the best-sounding performance characteristics I've heard of is given here (ACM subscription or fee required). It takes 5 flops per input, but is written (in C) to exploit instruction-level parallelism and only run 2 - 3 times slower than naive summation, which sounds rather good for a precise result. (c.f. Kahan summation at 4 flops per input)
Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
Luke Usherwood
  • 3,082
  • 1
  • 28
  • 35
  • 9
    +1 for a well-written, interesting question with a test-case thrown in! (Very few questions like this on SO nowadays...) – Oliver Charlesworth May 10 '14 at 12:40
  • 2
    I would expect that the answer would be "no, don't expect stability at all." – Louis Wasserman May 10 '14 at 12:46
  • 8
    I think the documentation of [DoubleStream::sum](http://docs.oracle.com/javase/8/docs/api/java/util/stream/DoubleStream.html#sum--) is pretty clear about this issue: "The value of a floating-point sum is a function both of the input values as well as the **order** of addition operations. The order of addition operations of this method is **intentionally not defined** to allow for implementation flexibility to improve the speed and accuracy of the computed result." – nosid May 10 '14 at 13:34
  • 1
    @nosid Yup, I think that's all there is to it & would happily accept that posted as an answer. I should have re-read that paragraph more carefully :-) Purely for my curiosity, I'm still interested if someone could say any more about the current implementation. What makes it stable in my tests? (I actually expected otherwise). Would it be stable on a Multi-CPU architecture? (I'm just multi-core.) – Luke Usherwood May 10 '14 at 14:43
  • 1
    Your largest data set is 2k floats, assuming the implementation is reasonably clever it won't spin up any threads for that size. Synchronization, etc (even assuming an existing thread pool) would be more expensive than a single threaded implementation I think (8 dp flops means we're taking <300 cycles here if optimal) – Voo May 10 '14 at 14:54
  • @Voo it uses 8 pooled threads even in my first iteration of 100. (I've also wondered whether I'd need to externally tune some 'switch-to-parallel threshold', but I'm not looking at that now.) See `ForkJoinPool` for the mechanics, which being in the concurrent package suggests to me that it's *concurrent* (avoiding synchronisation) as far as possible. – Luke Usherwood May 10 '14 at 14:59
  • @Voo +1 excellent caveat to raise. I've got around to performance, so I tried with the `-server` jvm option (which in J8 implies `-XX:+TieredCompilation`). Wow! The serial summing codes are now ~11x faster than `-client` (usual microbenchmark caveats apply...) Parallel no longer keeps up at these data sizes (~2x slower than serial on 4-core i7-860). (NB: for a large GUI application I still find 32-bit Java with `-client` has much better memory characteristics {think, 3x lower}. Maybe I should launch a separate JVM instance for the numerics, if they are implemented in Java.) – Luke Usherwood May 10 '14 at 17:41
  • @Luke I'm more surprised that the original implementation doesn't take the size into account, I assumed that for such a small array it just wouldn't spin up additional threads which would explain the stability. And yes the client compiler has better memory and startup performance, but lacks many useful optimizations. Tiered compilation not only gives usually better performance, but should also help with startup time.. memory not so much sadly. – Voo May 10 '14 at 17:50
  • @Voo I see that this stems from the design of streams really. As a terminal operation `sum()` may be the end of a compound-operation, so it couldn't easily apply such heuristics, even if it was in control of the threading. – Luke Usherwood May 11 '14 at 13:18

2 Answers2

9

I think the documentation of DoubleStream::sum is pretty clear about this issue:

[..] The value of a floating-point sum is a function both of the input values as well as the order of addition operations. The order of addition operations of this method is intentionally not defined to allow for implementation flexibility to improve the speed and accuracy of the computed result. [..]

That means, you shouldn't rely on stability, in particular not for parallel streams.


On the other hand, it's not surprising, that you see the same results for each run. Conceptually, the sum method might be implemented as follows:

double sum(double[] array, int startInclusive, int endExclusive) {
    int distance = endExclusive - startInclusive;
    if (distance < 1000) {
        double total = 0;
        for (int i = startInclusive; i < endExclusive; ++i) {
            total += array[i];
        }
        return total;
    } else {
        int middle = startInclusive + distance / 2;
        var left = async sum(array, startInclusive, middle);
        var right = async sum(array, middle, endExclusive);
        return await left + await right;
    }
}

Although the scheduling of the asynchronously executed tasks is nondeterminstic, the method always returns the same result, because the order of addition operations is the same (i.e. the parentheses are not rearranged).

However, a more sophisticated implementation might consider the current work load as well as the expected execution time of sub-tasks (in comparison with the costs of asynchronous operations). If that happens, the results might vary.

nosid
  • 48,932
  • 13
  • 112
  • 139
1

I do get different results from what you posted for the parallel summation, so I can confirm that it is not stable in all circumstances. The serial summation appears to behave the same in your test and in my test. My JVM may be different from yours, and I may have a different number of cores than you have. Anyway, here are the results I obtained for the same iterations that you posted results for.

Oracle Corporation
Java HotSpot(TM) 64-Bit Server VM
25.51-b03
-----
Min: -1.89188, Max: 1.90414, Average: 0.0541140
Serial difference:   -2.66454e-15
Parallel difference: -2.66454e-15
-----
Min: 0.000113827, Max: 3.99513, Average: 1.17402
Serial difference:   1.70530e-13
Parallel difference: 1.70530e-13
-----
Min: -7.95673, Max: 7.87757, Average: 0.0658356
Serial difference:   0.00000
Parallel difference: 3.55271e-15
-----
Min: 2.53794e-09, Max: 15.8122, Average: 2.96504
Serial difference:   -4.54747e-13
Parallel difference: -4.54747e-13
Evan VanderZee
  • 797
  • 6
  • 10