Find smallest integer that satisfies floating point inequality equation

Question

I am looking for a fast algorithm that finds the smallest integer N that will satisfy the following inequality where s, q, u, and p are float numbers (using the IEEE-754 binary32 format):

s > q + u * p / (N - 1)

where N can be any positive integer represented by a signed 32-bit integer. After (N - 1) is converted to float, all arithmetic is evaluated in float.

Additional constraints are:

0 < p < 1.
-1 ≤ q ≤ 1.
q < s.
0 < u.

I am having trouble figuring out how to do this in a robust way that deals with floating point rounding errors and comparison properly. Here is my poor attempt at a solution that is not fast and is not even robust in that I cannot determine the minimum SOME_AMOUNT:

int n = std::max(1.0f, floorf((u * p / (s - q)) - 1.0f));

// Floating point math might require to round up by some amount...
for (int i = 0; i < SOME_AMOUNT; ++i)
    if (!(q + (u * p / (n + 1)) < second))
        ++n;

You can see above my formula for calculating n using basic algebra. The for loop is my crude means of trying to account for floating point rounding errors. I am checking it with brute force like this:

int nExact = 0;
bool found = false;
for (; nExact < SOME_BIG_NUMBER; ++nExact) {
    if (q + (u * p / (nExact + 1)) < second) {
        found = true;
        break;
    }
}
assert(found);
assert(n == nExact);

Any floating point gurus have a reasonably fast an answer in C++?

Frankly, if someone can even just give a theoretically sound proof of an upper bound on 'SOME_AMOUNT' above I'd be reasonably happy...

Before unleashing fingers to write code, do some basic algebraic manipulations on paper to turn `s > q + u * p / (N - 1)` into an inequality with `N` on one side and everything else on the other. You'll have to allow for a few cases (e.g. if the algebraic manipulation involves dividing by something, take care of the case where that something is zero) but you'll wind up with some simple closed form formulae to calculate `N` given the values of `p`, `q`, `u`, and `s`. At most, a few `if()` and `else`, and definitely no need for a loop. — Peter, May 17 '20 at 08:18
Do you want a solution in which `s > q + u * p / (N - 1)` is true when evaluated with floating-point arithmetic or a solution in which s > q + u * p / (N - 1) is true when evaluated with real-number arithmetic? Is the domain of N the set of integers representable in the floating-point format or the set of integers? Do p and q have the same sign? Is s > q? What do you know about s, q, u, and p? Do you know any bounds on their values? Any constraints on their domains? Where do they come from? — Eric Postpischil, May 17 '20 at 09:25
@Peter: Those may not be valid transformations, depending on the answers to the questions in my above comment. — Eric Postpischil, May 17 '20 at 09:27
Peter, if you look at my code you'll see that I've already done that. The for loop is just to handle the rounding errors. Eric, I'm looking for a floating point solution. The domain is the set of positive integers by 32 bit integer. 'p' is guaranteed to be in the range greater than 0.0f but less than or equal to 1.0f. 'q' is guaranteed to be in the range -1.f to 1.f. — Yeshe Tenley, May 17 '20 at 11:05
s - q is guaranteed not to be 0. And yes, s is greater than q. — Yeshe Tenley, May 17 '20 at 11:13
Just to cut out part of the problem, given `s > q`, if `u` and `q` have different signs, then the solution is 2, assuming 1 is ruled out due to division by zero, since `u * q / (2-1)` is then negative or zero, and `s > q + u * q / (2-1)` is true. So we can reduce the problem to `u` and `p` having the same sign. And `u * q` can be replaced by `x`, as they do not otherwise participate in the expression. So we have `s > q + x / (N-1)`, where x is positive. — Eric Postpischil, May 17 '20 at 11:49
The basic arithmetic operations are weakly monotonic in floating-point where the corresponding real-number operations are monotonic or weakly monotonic. That may be useful in establishing a bound for checking candidates for `N`. (Obviously, N could be found easily in real arithmetic, but given we are requested to find a solution in floating-point arithmetic, rounding issues may cause a floating solution for `N` to differ from a real solution for N. Establishing bounds can give us an efficient empirical solution.) — Eric Postpischil, May 17 '20 at 11:52
Is the floating-point type definitely `float`, not `double`, and is it using IEEE-754 binary32 (the most common format used for `float`)? — Eric Postpischil, May 17 '20 at 12:54
It is definitely a float and not a double and yes, it is using standard floating point format for C++. — Yeshe Tenley, May 17 '20 at 13:02
One thing to consider is that, since N is a 32-bit integer, and the expression is evaluated using `float`, N must be converted to `float`, which introduces a rounding error. Consider the case where q is at least ½s. Then `s-q` computed in `float` is exact (has no rounding error), and the smallest `float n` that satisfies `s > q + x/n` is `(s-q)/x` or 1 ULP higher or lower, depending on rounding in the division. We may find, for example that `n` is 2147483392. In that case, `N` would be 2147483266, because then `N-1` is 2147483265, and that is the smallest integer that rounds up to 2147483392. — Eric Postpischil, May 17 '20 at 16:23
So, to be clear, is that really the solution you want? We would produce 21474832656 for N because, when `N-1` is converted to floating-point, that produces 2147483392, and then dividing by that causes `s > q + x/(N-1)` to be true, even though s > q + x/(N-1) would not be true with real-number arithmetic because the rounding up of N-1 to that greater number would not occur? — Eric Postpischil, May 17 '20 at 16:25
You don't need to worry about overflow for the 32 bit integer. Assume that an integer less than std::numeric_limits::max() will satisfy the equation. IOW, the corner case where the largest int does not satisfy the equation is the least of what I'm worried about. And yes, the rounding errors are precisely what I'm trying to overcome. But I don't know how to do it and still give an exact and robust answer for the smallest integer that will satisfy the equation. That's why I asked the question :) — Yeshe Tenley, May 17 '20 at 17:17
I'm assuming `s`, `q`, `u`, and `p` are fixed values for every instance of this problem. If that's the case, you can use an SMT solver to minimize the value of `N` each time you have a new set of values for `s`, `q`, `u`, and `p`. If you post a specific instance, I can run an SMT solver problem to find the corresponding minimum `N`. Then, you can write a program to do this on your own, if you're willing to program a bit using the publically available APIs. (In particular, I'm thinking of z3: https://github.com/Z3Prover/z3) — alias, May 17 '20 at 18:02
s, q, u, and p, are all variables that can range over the values specified in the OP. It is no problem to find an exact and robust answer to this question by a simple for loop going from integer 0 to some large number, however that is not a fast or efficient solution. I don't see how an SMT solver is going to give a fast or efficient solution as the problem here has to do with rounding errors and float comparison. — Yeshe Tenley, May 17 '20 at 20:13
An SMT solver can directly optimize and give you the best solution, They model IEEE-754 semantics with all rounding modes, so you can find specific values of `N` for each rounding mode you want. They do this without iteration, but rather modeling the underlying circuit and performing maxsat optimization. (Whether it'll beat your loop is something to see after experimenting, of course.) See here for details: http://smtlib.cs.uiowa.edu/theories-FloatingPoint.shtml — alias, May 17 '20 at 20:37
Do you have any further bounds on `s` and `q`? If 1 < `s`, then I think `nexttowardf(v, -INF) / (nexttowardf(s, INF) - q) + 1` could be a lower bound on `N`, and it would be reasonably tight in that the solution is within a few increments of it. In the absence of 1 < `s`, the denominator gets more complicated. — Eric Postpischil, May 19 '20 at 01:31
Are you content to handle only one rounding mode and strict evaluation (as is common outside the x87)? Maybe the latter is what you meant by “evaluated in `float`”. — Davis Herring, May 19 '20 at 02:22
Davis, eric edited the OP to include the line 'evaluated in float' so you'll have to ask him. For my part, I can say that I have been using only float type and 32 bit int. — Yeshe Tenley, May 19 '20 at 13:42
Eric, no, I'm sorry, but s does not have any more constraints. It s can be negative even as can q. q does have constraints as it is guaranteed to be -1 <= q <=1 as you edited the OP to show. — Yeshe Tenley, May 19 '20 at 13:45
BTW, I think `const int n = floorf((u * p / (s - q)) - 1.0f);` is a lower bound, yes? — Yeshe Tenley, May 19 '20 at 13:47
Is there a guarantee that `s` and `q` are such that their difference (`s - q`) is always a normalized floating-point number `s - q > FLT_MIN`? — EOF, May 19 '20 at 18:21
I can tell you that s and q are not equal and that s and q are both normal floating point numbers. — Yeshe Tenley, May 19 '20 at 18:41
Yeshe Tenley Is `float s` meant to be a whole number like 0.0f, 1.0f, 2.0f, .... or any `float` more than -1.0f or something else? — chux - Reinstate Monica, May 21 '20 at 10:58

Ankur Parihar · Answer 1 · 2020-05-22T12:31:20.897

To be on safe side, we can first get a larger possible value (upper bound) and a smaller possible value (lower bound) and than reduce it to our actual answer, this way it will be accurate and faster that just iterating over numbers.

By solving the inequality we get,

N > u * p / (s - q) + 1

Getting an upper bound

So you will first find a maximum guessed answer, by using integers. We will increase numerator and integer cast denominator

int UP = (int)(u * p + 1);    // Increase by one
int D = (int)(s - q);         // we don't increase this because it  would cause g to decrease, which we don't want

float g = UP / (float)D + 1;  // we again float cast D to avoid integer division
int R = (int)(g + 1);         // Now again increase g

/******** Or a more straight forward approach ********/
int R = (int)(((int)(u*p+1))/(s-q) + 1 + 1)

// Add rounding-off error here
if(R + 128 < 0) R = 2147483647;    // The case of overflow
else R += 128;

This is your maximum answer (upper bound).

Getting a lower bound

Just as previous but this time we will increase denominator and integer cast numerator

int UP = (int)(u * p);         // will automatically decrease
int D = (int)(s - q + 1);      // we increase this because it would cause g to decrease, which we want

float g = UP / (float)D + 1;   // we again float cast D to avoid integer division
int L = (int)g;                // Integer cast, will automatically decrease
/******** Or a more straight forward approach ********/
int L = (int)(((int)(u*p))/(s-q+1) + 1)

// Subtract rounding-off error
if(L - 128 <= 1 ) L = 2;        // N cannot be below 2
else L -= 128;

This is your minimum answer (lower bound).

Note: The reason of integer casting is to reduce our sample space. It can be omitted if you feel so.

Elimination of possible numbers and getting the right one

for (int i = L; i <= R; ++i){
    if ((s > q + u*p/(i-1))) break;   // answer would be i
}
N = i;    // least number which satisfies the condition

You can do it even faster with binary search if the gap between bounds (R-L) is large. As for number-range whose difference is 2^n can be reduced in only n steps.

// we know that
// lower limit = L;
// upper limit = R;
// Declare u, p, q, s in global space or pass as parameters to biranySearch

int binarySearch(int l, int r)
{
    if(l==r) return l;

    if (r > l) {
        int mid = l + (r - l) / 2;

        bool b = (s > q + (p*u)/(mid-1));

        if (b==true){
            // we know that numbers >= mid will all satisfy
            // so our scope reduced to [l, mid]
            return binarySearch(l, mid);
        }
        // If mid doesn't satisfy
        // we know that our element is greater than mid
        return binarySearch(mid+1, r); 
    } 
} 

int main(void) 
{
    // calculate lower bound L and upper bound R here using above methods
    int N = binarySearch(L, R);
    // N might have rounding-off errors, so check for them
    // There might be fluctuation of 128 [-63 to 64] so we will manually check.
    // To be on safe side I will assume fluctuation of 256
    L = N-128 > 2 ? N-128 : 2;
    R = N+128 < 0 ? 2147483647 : N+128;
    for(int i=L; i<=R; ++i){
        if( s > q + u * p / ((float)i - 1)) {
            break;
        }
    }
    cout << i << endl;
}

It is mostly a concept, but it is fast and safe. The only thing is that I have not tested it, but it should work!

I will give this a try I guess, but your comments are confusing... you say, "// we don't round off this because increasing this would cause g to decrease, which we don't want" but you *do* round it off by casting to integer... — Yeshe Tenley, May 21 '20 at 13:52
@YesheTenley By rounding-off I mean nearest integer for example 5.7 becomes 6, while casting to integer will make it 5. Yes some of my comments are confusing I am changing them now! — Ankur Parihar, May 21 '20 at 13:58
@YesheTenley Thanks for pointing out this rounding-stuff, I found a huge mistake. Rounding 4.3 will make it 4, but consciously I wanted it to become 5, so I have removed rounding and instead added 1. Now it's good! The previous mistake was due to copy-pasting same code twice, I forgot to edit comments. — Ankur Parihar, May 21 '20 at 14:09
For s = 1, q = 0, u = 2^30 = 1073741824, p = 1, this code gives a lower bound of 536870912 and an upper bound of 1073741824, but the correct answer is 1073741890 — Eric Postpischil, May 22 '20 at 01:49
@EricPostpischil for these constrains bounds given by my code are ```[2^29+1, 2^30+3] => [536870913, 1073741827]```. And the correct answer is 2^30+2 => 1073741826 which is lesser than your answer, lies within bounds and satisfies the inequality. Please check again! — Ankur Parihar, May 22 '20 at 04:58
@AnkurParihar: Note that in the problem statement (edited by me based on comments by OP), the subexpression `N-1` is converted to `float`, and then all arithmetic is performed in `float`, using the IEEE-754 binary32 format. With `N` = 1073741826, `N-1` is 1073741825. The binary32 numbers nearest that are 1073741824 and 1073741952. Commonly, round-to-nearest is used, so 1073741824 is the result of the conversion to `float`. Then `u*p` is 1073741824, so `u*p/(N-1)` is 1, and we have `1 > 0 + 1`, which is false. — Eric Postpischil, May 22 '20 at 10:47
@EricPostpischil I see, there might be a problem for larger numbers. I did some experiment here: https://www.h-schmidt.net/FloatConverter/IEEE754.html and found that for values near 2^31 we may have error of around 128. So I will add this to upper bound, and subtract from lower bound. As Binary-search still would take around only 32 iterations. It should be fast and safe. — Ankur Parihar, May 22 '20 at 11:59
The binary search is neat, but it does not answer the question... i'm going to check Eric's answer below as it might actually have theoretically correct upper/lower bounds — Yeshe Tenley, May 22 '20 at 16:53

score 0 · Answer 2 · answered May 22 '20 at 02:54

Here is the start of a solution. Some caveats:

It is in C, not C++.
It assumes IEEE-754 arithmetic with rounding to nearest.
It does not handle cases where the inequality requires N to go out of the bounds from 2 to INT_MAX.
I have not tested it much.

The code first using floating-point arithmetic to estimate where the boundary where the inequality changes is, neglecting rounding errors. It tests the inequality to see whether it needs to increase or decrease the candidate value. Then it iterates through consecutive integer float values to find the boundary. My sense is this will take few iterations, but I have not analyzed it completely.

This produces the least float with an integer value that satisfies the inequality when used in place of the denominator N-1. The code then finds the least int N such that N-1 rounds to that float, and that should be the N that is the least int for which the inequality is satisfied.

#include <math.h>
#include <stdio.h>
#include <stdlib.h>


//  Test the inequality.
static int Test(float s, float q, float u, float p, int N)
{
    return s > q + (float) (((float) (u * p)) / (N-1));
}


int main(void)
{
    float s = 1;
    float q = 0;
    float u = 0x1p30, p = 1;

    /*  Approximate the desired denominator (N-1) -- would be exact with real
        arithmetic but is subject to rounding errors.
    */
    float D = floorf(u*p/(s-q));

    //  Test which side of the boundary where the inequality changes we are on.
    if (Test(s, q, u, p, (int) D + 1))
    {
        //  We are above the boundary, decrement find the boundary.
        float NextD = D;
        do
        {
            D = NextD;
            //  Decrement D by the greater of 1 or 1 ULP.
            NextD = fminf(D-1, nexttowardf(D, 0));
        }
        while (Test(s, q, u, p, (int) NextD + 1));
    }
    else
        //  We are below the boundary, increment to find the boundary.
        do
            //  Increment D by the greater of 1 or 1 ULP.
            D = fmaxf(D+1, nexttowardf(D, INFINITY));
        while (!Test(s, q, u, p, (int) D + 1));

    //  Find the distance to the next lower float, as an integer.
    int distance = D - nexttowardf(D, 0);

    /*  Find the least integer that rounds to D.  If the distance to the next
        lower float is less than 1, then D is that integer.  Otherwise, we want
        either the midpoint between the D and the next lower float or one more
        than that, depending on whether the low bit of D in the float
        significand is even (midpoint will round to it, so use midpoint) or odd
        (midpoint will not round to it, so use one higher).

        (int) D - distance/2 is the midpoint.

        ((int) D / distance) & 1 scales D to bring the low bit of its
        significand to the one’s position and tests it, producing 0 if it is
        even and 1 if it is odd.
    */
    int I = distance == 0 ? (int) D
        : (int) D - distance/2 + (((int) D / distance) & 1);

    //  Set N to one more than that integer.
    int N = I+1;

    printf("N = %d.\n", N);

    if (Test(s, q, u, p, N-1) || !Test(s, q, u, p, N))
    {
        fprintf(stderr, "Error, solution is wrong.\n");
        exit(EXIT_FAILURE);
    }
}

Find smallest integer that satisfies floating point inequality equation

2 Answers2