C thinking : float vs. integers and float representation

Question

When using integers in C (and in many other languages), one must pay attention when dividing about precision. It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.

But what about floats? Does that still hold? Or are they represented in such a way that it is better to divide number of similar orders of magnitude rather than large ones by small ones?

First you need to understand how floats are represented in memory, see [this answer](https://softwareengineering.stackexchange.com/a/215126/235262) (from Software Engineering) — Martin Verjans, Jun 14 '17 at 13:49
What do you mean that when using Integer one must pay attention about precision? If you divide an integer you are using integer division with all the consequences the precision is at the 0th decimal digit. In this sense integer division is precise. — Davide Spataro, Jun 14 '17 at 13:56
Keeping rounding errors at a minimum when doing floating point operations is a rather huge topic. As a simple rule, it is best if the operands have similar magnitude. For instances: Adding a small float value to a huge float value, will not change the huge value at all (provide that the difference between is sufficiently) . — Support Ukraine, Jun 14 '17 at 14:00
Yes but most of the time when you implement you use integer division as a degraded version of the ideal division you would like to perform, and simply live with the fact that no decimal parts come out. Yet if you change units your results change ... example, you work in meters and divide 10 meters by 4 and get 2, but if you switch to millimeters, 10 000 divided by 4 becomes 2500, a much better result ... — Charles, Jun 14 '17 at 14:00
@4386427 Yes, adding (or subtracting) numbers of dissimilar magnitude can result in loss of precision (this is a real problem when adding many numbers of similar magnitude, e.g. to find the average), but multiplying (or dividing) numbers of dissimilar magnitude isn't so bad. — Ian Abbott, Jun 14 '17 at 14:06
Of course the classic example is that `0.1 + 0.2 == 0.3` evaluates to `False` — Baldrickk, Jun 14 '17 at 14:09
@MartinVerjans: I'd hesitate to link to that answer - it's really not that good an explanation, and I got some things wrong. Better to link to an authoritative source (i.e., not me). — John Bode, Jun 14 '17 at 15:03
@DavideSpataro: an extreme example of "paying attention to precision" would be to multiply 3 16-bit numbers (with full possible range) together (in eg a 32-bit integer word environment), and then to divide by a 24-bit one. You need to sequence or re-factor the operations so as not to lose significant digits by under- or overflow, even though "end-to-end", the result will fit inside a 32-but word. — MikeW, Jun 14 '17 at 15:49

MikeW · Answer 1 · 2017-07-04T11:19:42.767

The representation of floats/doubles and similar floating-point working, is geared towards retaining numbers of significant digits (aka "precision"), rather than a fixed number of decimal places, such as happens in fixed-point, or integer working.

It is best to avoid combining quantities, that may give rise to implicit under or overflow in terms of the exponent, ie at the limits of the floating-point number range.

Hence, addition/subtraction of quantities of widely differing magnitudes (either explicitly, or due to having opposite signs)) should be avoided and re-arranged, where possible, to avoid this well-known route to lost precision.

Example: it's better to refactor/re-order

small + big + small + big + small * big

as

(small+small+small) + big + big

since the smalls individually might make no difference to a big, and hence their contribution might disappear.

If there is any "noise" or imprecision in the lower bits of any quantity, it's also wise to be aware how loss of significant bits propagates through a computation.

chux - Reinstate Monica · Answer 2 · 2017-06-14T15:52:57.940

With integers:
As long as there is no overflow, +,-,* is always exact.
With division, the result is truncated and often not equal to the mathematical answer.
ia,ib,ic, multiplying before dividing ia*ib/ic vs ia*(ib/ic) is better as the quotient is based on more bits of the product ia*ib than ib.

With floating point:
Issues are subtle. Again, as long as no over/underflow, the order or *,/ sequence make less impact than with integers. FP */- is akin to adding/subtracting logs. Typical results are within 0.5 ULP of the mathematically correct answer.

With FP and +,- the result of fa,fb,fc can have significant differences than the mathematical correct one when 1) values are far apart in magnitude or 2) subtracting values that are nearly equal and the error in a prior calculation now become significant.

Consider the quadratic equation:

double d = sqrt(b*b - 4*a/c);  // assume b*b - 4*a/c >= 0
double root1 = (-b + d)/(2*a);
double root2 = (-b - d)/(2*a);

Versus

double d = sqrt(b*b - 4*a/c);  // assume b*b - 4*a/c >= 0
double root1 = (b < 0) ? (-b + d)/(2*a)  :  (-b - d)/(2*a)
double root2 = c/(a*root1);  // assume a*root1 != 0

The 2nd has much better root2 precision result when one root is near 0 and |b| is nearly d. This is because the b,d subtraction cancels many bits of significance allowing the error in the calculation of d to become significant.

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

(for integer) It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.

Does that still hold (for floats)?

In general the answer is No

It is easy to construct an example where adding all input before division will give you a huge rounding error.

Assume you want to add 10000000000 values and divide them by 1000. Further assume that each value is 1. So the expected result is 10000000.

Method 1 However, if you add all the values before division, you'll get the result 16777.216 (for a 32 bit float). As you can see it is pretty much off.

Method 2 So is it better to divide each value by 1000 before adding it to the result? If you do that, you'll get the result 32768.0 (for a 32 bit float). As you can see it is pretty much off as well.

Method 3 However, if you go on adding values until the temporary result is greater than 1000000 and then divide the temporary result by 1000 and add that intermediate result to the final result and repeats that until you have added a total 10000000000 values, you will get the correct result.

So there is no simple "always add before division" or "always divide before adding" when dealing with floating point. As a general rule it is typically a good idea to keep operands in similar magnitude. That is what the third example does.

C thinking : float vs. integers and float representation

3 Answers3