Rounding error reduction?

Question

Consider the following functions:

#include <iostream>
#include <iomanip>
#include <cmath>
#include <limits>

template <typename Type>
inline Type a(const Type dx, const Type a0, const Type z0, const Type b1)
{
    return (std::sqrt(std::abs(2*b1-z0))*dx)+a0;
}

template <typename Type>
inline Type b(const Type dx, const Type a0, const Type z0, const Type a1)
{
    return (std::pow((a1-a0)/dx, 2)+ z0)/2;
}

int main(int argc, char* argv[])
{
    double dx = 1.E-6;
    double a0 = 1;
    double a1 = 2;
    double z0 = -1.E7;
    double b1 = -10;
    std::cout<<std::scientific;
    std::cout<<std::setprecision(std::numeric_limits<double>::digits10);
    std::cout<<a1-a(dx, a0, z0, b(dx, a0, z0, a1))<<std::endl;
    std::cout<<b1-b(dx, a0, z0, a(dx, a0, z0, b1))<<std::endl;
    return 0;
}

On my machine, it returns:

0.000000000000000e+00
-1.806765794754028e-07

Instead of (0, 0). There is a large rounding error with the second expression.

My question is: how to reduce the rounding error of each function without changing the type (I need to keep these 2 functions declarations (but the formulas can be rearanged): they come from a larger program)?

How are `dx` and `z0` determined? It looks like `dx` might be some semi-arbitrary step size that you can choose, and that `z0` is `b1/dx`. If so, setting `dx` to a (negative) power of two and to a larger value, and adjusting `z0` correspondingly, will reduce error some. For example, with `dx` set to 1/1024 and `z0` set to 10240, the error, when evaluated entirely with `double`, is reduced to 9e-13. You should not expect to get it to zero, though. — Eric Postpischil, Nov 14 '13 at 03:27
Additionally, if `b` can be modified to omit the addition of `a0` (and its callers adjusted accordingly; e.g., `a` would be modified not to subtract it, and other routines would add it themselves when desired), then enough accuracy is retained that the final error is zero in some cases (e.g., when `dx` is 1/1048576, which is 2**-20, close to the original value of 10**-6). — Eric Postpischil, Nov 14 '13 at 03:32

score 1 · Accepted Answer · answered Nov 14 '13 at 01:05

Sadly, all of the floating point types are notorious for rounding error. They can't even store 0.1 without it (you can prove this using long division by hand: the binary equivalent is 0b0.0001100110011001100...). You might try some workarounds like expanding that pow to a hard-coded multiplication, but you'll ultimately need to code your program to anticipate and minimize the effects of rounding error. Here are a couple ideas:

Never compare floating point values for equality. Some alternative comparisons I have seen include: abs(a-b) < delta, or percent_difference (a,b) < delta or even abs(a/b-1) < delta, where delta is a "suitably small" value you have determined works for this specific test.
Avoid adding long arrays of numbers into an accumulator; the end of the array may be completely lost to rounding error as the accumulator grows large. In "Cuda by Example" by Jason Sanders and Edward Kandrot, the authors recommend recursively adding each pair of elements individually so that each step produces an array half the size of the previous step, until you get a one-element array.

The OP already understands that floating-point arithmetic is approximate, and their question does not contain additions of long arrays. So this answer is unhelpful. Even though floating-point arithmetic, there are techniques to avoid or reduce errors in various situations, and that is what the question is asking for. — Eric Postpischil, Nov 14 '13 at 03:14

snips-n-snails · Answer 2 · 2013-11-14T01:30:12.497

In a(), you lose precision when you add a0 (which is exactly 1) to the small and imprecise result of sqrt()*dx.

The function b() doesn't lose any precision using the supplied values.

When you call a() before b() as in the second output, you're doing mathematical operations on a number that's already imprecise, compounding the error.

Try to structure the mathematical operations so you do operations that are less likely to create floating point errors first and those more likely to create floating point errors last.

Or, inside your functions, make sure they are operating on "long double" values. For example, the following uses floating-point promotion to promote double to long double during the first mathematical operation (pay attention to operator precedence):

template <typename Type>
inline Type a(const Type dx, const Type a0, const Type z0, const Type b1)
{
    return (std::sqrt(std::abs(2*static_cast<long double>(b1)-z0))*dx)+a0;
}

Keep in mind that some compilers have a long double that is the same number of bits as a double, such as the Microsoft C++ compiler. — David Rector, Sep 09 '18 at 20:31

Rounding error reduction?

2 Answers2