Introduction
This answer presents an initial examination of the error in:
float result = valA * 0.587f + valB * 0.331f;
In this answer, values in the floating-point format and expressions computed with the floating-point format will be represented with code style, as in z
or x * y
. Mathematical variables will use italic and will not be in code style, as in z or x • y.
I assume that all arithmetic is done with IEEE-754 basic 32-bit binary floating-point. This format is commonly used for the float
type, although some programming language implementations mix precisions, possibly using double
or other precision while evaluating expressions of float
type. I also assume all arithmetic is done using the round-to-nearest mode, with ties to the number with the even low bit.
This format as 24 bits in the significand, so the unit of least precision (ULP) is normally 2−23 times the value of the most significant bit. This is the step size between representable values. For example, for values in [1, 2), the ULP is 2−23. For values in [128, 256), the ULP is 27•2−23 = 2−16. (For subnormal values, the significand has fewer bits. The lowest the ULP can be is 2−149. Beyond the largest finite representable value, the step size to the next representable value is infinite. However, in this question, only values of modest value are involved, so we can neglect infinity.)
The result of computing any operation with correct rounding is at most ½ ULP away from the correct answer. That is, if we compute z = x + y
, for example, the computed result z
differs from the exact mathematical result z = x
+ y
by at most ½ ULP of z. (Although z is an exact mathematical result with infinite precision, we use its magnitude to determine which range it falls in in the floating-point format, which determines what we mean by the ULP of z.) The reason the error is at most ½ ULP is that, if the two representable values nearest z are z0
and z1
, we must have z0
≤ z ≤ z1
, and if ½ ULP < z1
− z, then z − z0
< ½ ULP (because z1
− z0
= 1 ULP, by definition of an ULP.) Therefore, in choosing the nearest representable value, we would pick the closer of z0
and z1
, so the error never exceeds ½ ULP.
As stated in a comment, valA
, valB
, and result
are in [0, 256).
Symbolic Analysis
By the time we start computing valA * 0.587f + valB * 0.331f
, valA
and valB
have some errors from previous operations. That is, ideally, using exact mathematics, we would have computed some numbers A and B, but instead the computer calculated valA
and valB
, and the differences are eA = valA
− A and eB = valB
- B.
Ideally, we would like to compute the number R such that R is, using exact mathematics, A • .587 + B • .331. When we use computer arithmetic:
0.587f
will be converted from .587 to the floating-point format, and the result will have some rounding error e0, so the result is 0.587f
= 0.587 + e0.
0.331f
will be converted from .331 to the floating-point format, and the result will have some rounding error e1, so the result is 0.331f
= .331 + e1.
valA * 0.587f
will be computed with some error e2, so the result will be valA * 0.587f
= valA
• 0.587f
+ e2.
valB * 0.331f
will be computed with some error e3, so the result will be valB * 0.331f
= valB
• 0.331f
+ e3.
- The two products will be added, with some error e4, so the result will be
valA * 0.587f + valB * 0.331f
= valA * 0.587f
+ valB * 0.331f
+ e4.
Now we can substitute the expressions, so:
valA * 0.587f + valB * 0.331f
= (valA
• 0.587f
+ e2) + (valB
• 0.331f
+ e3 + e4.
valA * 0.587f + valB * 0.331f
= (valA
• (0.587 + e0) + e2) + (valB
• (.331 + e1) + e3) + e4.
valA * 0.587f + valB * 0.331f
= ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4.
With this, we have expressed the computed result, valA * 0.587f + valB * 0.331f
, as an exact mathematical expression (of variables with incompletely known values), ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4.
Numerical Analysis
Next, we can place some bounds on the errors. e0 and e1 are easy, their magnitudes are at most ½ ULP of .587 and .331, respectively. .587 is in [½, 1), so its ULP is 2−24, and .331 is in [¼, ½), so its ULP is 2−25. So |e0| ≤= 2−25, and |e1| ≤= 2−26.
Bounds on e2 and e3 depend on the magnitudes of valA * 0.587f
and valB * 0.331f
. Since val
< 256, valA * 0.587f
< 256, so its ULP is at most 2−16, and |e2| ≤ 2−17. With valB
, we can see that valB * 0.331f
< 128, so the ULP of valB * 0.331f
is at most 2−17, and |e3| ≤ 2−18.
Finally, we have the error e4 that occurs in the final addition of valA * 0.587f + valB * 0.331f
. We have assumed this is less than 256, so its ULP is at most 2−16, and |e4| ≤ 2−17.
Looking at the mathematical expression of the computed result, ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4, we can see that the largest possible error occurs when e0, e1, e2, e3, and e4 have the greatest values (unless eA or eB is huge and negative, which we assume not to be true). So we can substitute the upper bounds we have prepared for these errors:
((A + eA) • (0.587 + 2−25) + 2−17) + ((B + eB) • (.331 + 2−26) + 2−18) + 2−17.
In the interests of time, I evaluated this with Maple. (It might be a bit more illuminating to expand the expression manually and retain some of the factors rather than consolidating coefficients into single numbers, but I leave that to the reader.) The result is:
2462056573/4194304000 • A + 2462056573/4194304000 • eA + 5/262144 + 2776629373/8388608000 • B + 2776629373/8388608000 • eB.
The ideal result is A • .587 + B • .331. When we subtract that from the above, the result is a bound on the error in the computation:
1/33554432 • A + 2462056573/4194304000 • eA + 5/262144 + 1/67108864 * B + 2776629373/8388608000 • eB.
Since A < 256 and B < 256, we can substitute 256 for A and for B, yielding:
1/32768 + 2462056573/4194304000 • eA + 2776629373/8388608000 • eB.
Reversing a bit of Maple’s arithmetic, that is:
2−15 + (.587 + 2−25) • eA + (.331 + 2−26) • eB.
So, that is an upper bound on the error in valA * 0.587f + valB * 0.331f
. It could possibly be reduced more with additional information about the relationship between valA
and valB
. Also, the errors in converting .587 and .331 to float
are exactly known, so those should be used instead of the bounds I used as illustration in this answer.
One also needs to establish a lower bound on the error. The rounding errors could be negative, and we have to ask what the lowest possible value of ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4 is. As I am out of time for now, that is left for the reader.
Addendum
e0 is 13/1048576000. e1 is 1/4194304000. Then the upper bound on the error can be reduced to 731/32768000 + 4924113/8388608 • eA + 11106517/33554432 • eB, which is:
.731•2−15 + (.587 + .013•2−20) • eA + (.331 + .001•2−22) • eB.