Approximation error when using sqrt and floor

Question

I have to do an enumeration of solutions of an equation and I know that y < x *( sqrt(n) - 1 ), where x, y and n are integers.

My naive approach would be to look for y less or equal than floor( x * ( sqrt( (float)n ) - 1 ) ).

Should I be worried about approximation error?
For example, if my expression is a little be greater than an integer m, should I be worried to get m-1 at the end?
How could I detect such errors?

One improvement is to take the multiplication outside the `floor` call. — Ben Voigt, Mar 11 '15 at 23:11
Do you need a numeric upper bound on `y`, or just a stopping condition for a loop? For example, if x is positive, that condition is equivalent to: `((x+y) < 0) || ((x+y)*(x+y) < n*x*x)` and when x is negative, it simplifies to `(x+y)*(x+y) > n*x*x` — Ben Voigt, Mar 11 '15 at 23:21
I think you need the multiplication inside the `floor` rather than outside: e.g. with x=10, n=15: `floor(x*sqrt((float)n)-x)` gives 28 while `x*floor(sqrt((float)n)-1)` gives 20. — halfflat, Mar 11 '15 at 23:41

halfflat · Accepted Answer · 2015-03-11T23:33:13.573

You should definitely be worried about approximation error, but how worried depends upon the ranges of values for x and n you are concerned about.

Calculations in IEEE 4-byte floating point representations are going to have errors roughly on the order of one part in 2^23 to 2^24; for an 8-byte representation (i.e, a double), it will be roughly one part in 2^52 to 2^53. You could expect then that you would need to use doubles rather than floats to get an accurate result for 32-bit integers x and n, and that even a double would be insufficient for 64-bit integers.

As an example, consider the code:

template <typename F,typename V>
F approxub(V x,V n) {
    return std::floor(x*std::sqrt(F(n))-x);
}

uint64_t n=1000000002000000000ull; // (10^9 + 1)^2 - 1
uint64_t x=3;
uint64_t y=approxub<double>(x,n);

This gives a value of y=3000000000, but the correct value is 2999999999.

It's even worse when x is large and n is small: large 64-bit integers are not exactly representable in IEEE doubles:

uint64_t n=9;
uint64_t x=5000000000000001111; // 5e18 + 1111
uint64_t y=approxlb<double>(x,n);

The correct value for y (putting to one side the issue of when n is a perfect square — the true upper bound will be one less in this case) is 2 x = 10000000000000002222, i.e. 1e19 + 2222. The computed y is, however, 10000000000000004096.

Avoiding floating point approximations

Suppose you had a function isqrt which exactly computed the integer part of the square-root of an integer. Then you could say

y = isqrt(x*x*n) - x

and provided that the product x*x*n fit inside your integer type, you would have an exact upper bound (or one more than the upper bound if n is a perfect square.) There's more than one way to write an isqrt function; this is an example implementation based on the material at code codex:

template <typename V>
V isqrt(V v) {
    if (v<0) return 0;

    typedef typename std::make_unsigned<V>::type U;
    U u=v,r=0;

    constexpr int ubits=std::numeric_limits<U>::digits;
    U place=U(1)<<(2*((ubits-1)/2));

    while (place>u) place/=4;
    while (place) {
        if (u>=r+place) {
            u-=r+place;
            r+=2*place;
        }
        r/=2;
        place/=4;
    }
    return (V)r;
}

What if x is too large for this? For example, if our largest integral type has 64 bits, and x is larger than 2^32. The most straightforward solution would be to do a binary search, taking as bounds x r - x and x r, where r = [√n] is the integer square root.

I've a feeling that "enumeration of solutions" rules out such large ranges. — Ben Voigt, Mar 11 '15 at 23:23
Thanks a lot! I will use the isqrt function but the information about the representations of decimal numbers is really imteresting. — rgugliel, Mar 13 '15 at 14:00

Approximation error when using sqrt and floor

1 Answers1

Avoiding floating point approximations

Linked