0

a, b are 32 bit floating point values, N is a 32 bit integer and k can take on values 0, 1, 2, ... M. Need to calculate c_k = a + ( N + k ) * b; The operations need to be 32 bit operations (not double precision). The concern is accuracy -- which of the following is more accurate?:

I) c_k = a + ( N + k ) * b

II) first calculate: c_0 = a + N * b
Then calculate c_1, c_2, etc. iteratively by addition:
c_1 = c_0 + b;
c_2 = c_1 + b;

user1823664
  • 1,071
  • 9
  • 16
  • My gut feeling is that option II is better (more performant), but I'm not so sure it is more accurate. I guess it depends on the data. – Rudy Velthuis May 16 '13 at 14:36
  • 4
    Chained addition is one of the worst operations you can do, because the rounding error in the last result will be the sum of the rounding errors on each addition in the chain. It would be more accurate to either use the first way, or to use `c_i = c_0 + b*i`. – Patricia Shanahan May 16 '13 at 14:52
  • @PatriciaShanahan You should submit an answer with your comment. It's critical information for this question. – shoelzer May 17 '13 at 16:19

2 Answers2

3

Chained addition is one of the worst operations you can do, because the rounding error in the last result will be the net sum of the single operation rounding error on each addition in the chain. It would be more accurate to either use the first way, or to use c_i = c_0 + b*i.

Patricia Shanahan
  • 25,849
  • 4
  • 38
  • 75
2

Since you do not seem to care about number of operations, assuming IEEE 754 model you can perform it exactly with 32 bits operations.
See Shewchuck Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates - http://www.cs.berkeley.edu/~jrs/papers/robustr.pdf or http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps

You define two exact operations (see the paper)

(product,residue) = twoproduct(a,b)
(sum,residue) = twosum(a,b)

Then you have to decompose N+k into two 24 bit significands, for example

NkH = (N+k) / 256;
NkL = (N+K) % 256;

Then you have two potentially inexact multiplications

( HH , HL ) = twoproduct( NkH , b)
( LH , LL ) = twoproduct( NkL , b)

Then you can sum these ( HH , HL ) + ( LH , LL ) + a

This can be performed exactly with a fast-expansion-sum (see the paper again)

(c1,c2,c3,c4,c5) = sort_increasing_magnitude(HH,HL,LH,LL,a)
(s2,s1) = twosum( c2,c1 )
(s3,s2) = twosum( c3,s2 )
(s4,s3) = twosum( c4,s3 )
(s5,s4) = twosum( c5,s4 )

You then get the exactly rounded result in s5 as if the operations were performed with infinite precision arithmetic.

aka.nice
  • 9,100
  • 1
  • 28
  • 40