1

I know that floating point addition is not associative: (a + b) + c in general does not equal a + (b + c). So this algorithm for sum can give a different result depending on the order of the input:

float naive_sum(float[] input) {
  float accumulator = 0;
  for (float x : input) {
    accumulator += x;
  }
  return accumulator;
}

Is it possible to make this order-independent, so that it returns the same result even when the input is shuffled? I'm not trying to reduce the rounding error: I just want it to be order-independent.

One idea is to sort the input first:

float sort_sum(float[] input) {
  return naive_sum(sort(input));
}

sort doesn't have to put the floats in numeric order; it just has to to satisfy sort(input) == sort(shuffle(input)). I think this works, but it's no longer constant space and linear time the way naive_sum was.

Another idea is to make the accumulator be a huge integer type: big enough to fit any float without rounding. If floats have an 11-bit exponent, you would need around 2^11 bits, which comes out to around 2000 bits.

float fixedpoint_sum(float[] input) {
  int2048 accumulator = 0;
  for (float x : input) {
    accumulator += float_to_fixed(x);
  }
  return fixed_to_float(accumulator);
}

Now it's constant space and linear time again, but with such a big accumulator, maybe it's a very slow linear time. :)

Are there any practical algorithms for order-independent summation of floating-point numbers?

dpercy
  • 459
  • 6
  • 13
  • Is a Java-specific solution acceptable? – Patricia Shanahan Apr 08 '20 at 17:34
  • 1
    The accumulator method might not be as bad as you think: Get the exponent, use it to index into the accumulator, add, then carry. The carries will occasionally propagate a lot, but rarely unless something in the application design leads to it. However, it is not truly constant time except for the fact it is bounded by accumulator width. That is, how many carries occur can affect the time. But it will often be small. – Eric Postpischil Apr 08 '20 at 18:06

2 Answers2

1

If your language has a high precision decimal type, such as Java's java.math.BigDecimal, use that to do the summation. Conversion from float or double to BigDecimal are exact. If you do not specify a MathContext that requires rounding BigDecimal addition is also exact. The final BigDecimal value will be the real number sum of the inputs, and real number addition is associative and commutative. The only rounding, and rounding error, will be on the conversion back to float, and that will be converting the same number regardless of input order.

This is similar to your accumulator idea, but taking advantage of an already tested data type and memory management that limits the size of the "accumulator".

private static float sum(float[] data) {
    BigDecimal adder = new BigDecimal(0);
    for(float f : data) {
        adder = adder.add(new BigDecimal(f));
    }
    return adder.floatValue();
}
Patricia Shanahan
  • 25,849
  • 4
  • 38
  • 75
-1

The "(a+b)+c does not equal a+(b+c)" issue comes from the fact that computers don't work with infinite precision, they are not mathematically exact; but they use some kind of representation which loses digits.

Read What Every Computer Scientist Should Know About Floating-Point Arithmetic for a detailed explanation.

This representation has granularity, which means that the difference between two consecutive representations is not constant. Big numbers can not be added with small numbers: 1.1E20 + 1E-5 = 1.1E20

Some small improvements:

  • To reduce this big&small issue you can sort the numbers. So, the summatory of the small values may reach enough size as of the big values and the addition may be more accurate. Still, there's no guarantee of good result.

  • Another technique may be to sum several times, in different orders (1,2,3... or 3,2,1... or 1,20,2,19,3,18... or...) and then calculate the average of all sums.

The most used (I believe) technique is to enlarge the number of digits used. For example 64 or 128 bits instead of 32. Or arbitrary-precision arithmetic. The price is 128 bits or more precision make computations slower.

There exists "Robust Predicates" and this EGC site which try to reduce errors to minimum, below the float/double epsilon.

Ripi2
  • 7,031
  • 1
  • 17
  • 33