3

I am looking primarily for a theoretical answer, ideally one addressing my misunderstanding (below) about binary search algorithms. Practical examples are welcome as well.

I have a complex function where I want to do a binary search on that function to find the value of n, so that computed_goal matches a given goal, where goal is a previously-given value I am seeking to match. The algorithm below illustrates this.

l = 0, h = 4000; // low, and high bounds for binary search
while (l < h)
{
    n = (l + h) / 2;

    x = vars / n; // vars represents a more complex computation, x depends on n
        
    computed_goal = ax^4 + bx^3 + cx^2 + dx + c; // computed_goal depends on n
        
    if (computed_goal < goal)
        l = n + 1;
    else
        h = n - 1;

    if (abs(computed_goal - goal) / goal < 0.01) // check for relative error within 1%
        return n; // successfully matched goal value!
    else
        return 0; // iteration failed to match goal to desired precision
}

Now, I think you may already be seeing a problem if you are familiar with binary search algorithm, but I'm not super-verse on them. My problem does not always present itself but sometimes it is as follows:

Say my function gives me the following values for these n:

n = 2998: computed_goal = 967.8732568 (-2.1267432 = computed_goal - goal)
n = 2999: computed_goal = 968.6512540 (-1.3487460)
n = 3000: computed_goal = 969.4294677 (-0.5705323)
n = 3001: computed_goal = 970.2078977 (+0.2078977) <= This is the optimal solution
n = 3002: computed_goal = 970.9865442 (+0.9865442)

Now, suppose l = 2987, and h = 3015. This makes n = 3001. My goal is to match value of 970. Note that n = 3001 is the optimal solution because |computed_goal - goal| is the smallest in magnitude and it is the closest to the goal

However, my algorithm looks at it and goes computed_goal = 970.2078977 < goal = 970 = False, and sets new h to be n - 1 = 3000, thus excluding the optimal solution from any future iterations. It so happens that my algorithm finally incorrectly settles in on n = 2999, two steps away from optimal solution, here's a listing for reference:

   l    h      n    computed_goal    goal
2987 3015 = 3001, 970.207897725643 < 970
2987 3000 = 2993, 963.9865161108   < 970
2994 3000 = 2997, 967.095475944351 < 970
2998 3000 = 2999, 968.651254012028 < 970

Misunderstanding

I understand that I am checking but then excluding the optimal solution and because I'm not backtracking back to it, it gets lost and excluded...

I was told before that you should (always) exclude the endpoints that you've already checked, because you've already checked them, so it does not make sense to include them in future search bounds. So like in this case I've checked 3001, and in any future iterations I exclude that value from further checks. I am guessing that is incorrect, and I am guessing I must include them(?) Can you help me with theory part of this?

Attempt to fix. I should note as well that when I change my if statement to include end points, like so:

   if (computed_goal < goal)
        l = n;
    else
        h = n;

Algorithm keeps executing indefinitely in an infinite look with l = 3000 and h = 3001. I am not sure how to go around that. I could change my while to while(l + 1 < h), which will break the infinite loop, but it looks off to me, and I am concerned that I may be introducing a different error into the code, and I don't have enough insight to say if I should be using the l + 1 < h loop condition.

Function Sketch

enter image description here

Code

function matchGoal(int $goal = 970): int
{
    $l = 0;
    $h = 3900;

    while ($l < $h)
    {
        $n = intdiv($h + $l, 2);
        $x = 8321100 / $n;
        $factor = 5.113233695 * pow($n / 3000, 2);

        $computed_goal = (((-8.16965E-10 * $x + -3.15457E-06) * $x + 0.008437163) * $x + 207.9079475) * $factor - 0.076934117;

        if ($computed_goal < $goal)
            $l = $n + 1;
        else
            $h = $n - 1;
    }

    $relative_error = 100 * abs($computed_goal - $goal) / $goal;
    return ($relative_error < 1) ? $n : 0;
}

// outputs 2999, when 3001 is optimal
print matchGoal() . PHP_EOL; 
Dennis
  • 7,907
  • 11
  • 65
  • 115
  • I believe it is monotonic at least as far as that interval is concerned. Let me see if I can do a sketch.. it might take me some time.. – Dennis Jun 15 '23 at 22:05
  • 2
    Well even `computed_goal` is monotonic, binary search inherently does not try to minimize `abs(computed_goal - goal)`, it can try to find the highest `computed_goal` that is less than `goal` or the lowest `computed_goal` that is higher than `goal` (depending on some details) but neither of those are the same thing as minimizing the distance to the goal – harold Jun 15 '23 at 22:16
  • I've added a function & formula sketch. And hmmm interesting, I did not think of binary search that way, thank you. I think for my specific function I can do the binary search to the lowest value higher than the `goal`, and then do a lookup on `n` and `n-1`, which should give me the value of n that will give the minimum abs distance to the goal. I am curious however if there are better algorithms that can be used for finding the minimum `abs` distance in my case. But with binary search, can I use `while (l + 1 < h)` in my while loop without ill effects? Is there a case where it may fail? – Dennis Jun 15 '23 at 22:46
  • 1
    There is [ternary search](https://en.wikipedia.org/wiki/Ternary_search), to find the optimum of a unimodal function (if the goal function is monotonic, the distance between it and the goal should be unimodal) but I think that's actually worse than doing a binary search and then checking the neighbour. IDK about `while (l + 1 < h)` but you shouldn't use that attempted fix to begin with (it was trying to compensate for something that cannot be compensated for) – harold Jun 15 '23 at 22:55
  • Binary search is simple in theory, but it only works in practice if you get the details right. The code you've shown doesn't have the details needed to answer your question. You need to post real code that demonstrates the problem. See [mcve]. – user3386109 Jun 16 '23 at 00:10
  • The code you posted is just buggy. It only ever does one iteration. – Matt Timmermans Jun 17 '23 at 04:28
  • @user3386109 I've posted the code at the bottom of my question. Matt, I've posted code using PHP – Dennis Jun 19 '23 at 17:47
  • 1
    The search reaches the point where `l=2998 h=3000 and n=2999`. The computed goal is less than the actual goal, so `l` is changed to 3000 and the loop ends. Which means that *after* the loop `l=3000 h=3000 and n=2999`. Then the code returns `n`, and that's the problem. The result of the binary search was 3000, and that's the value that should have been returned. – user3386109 Jun 19 '23 at 19:18
  • 1
    Gassa's answer already explains how to fix the code. What you want is a search that ends when `l` is equal to `h-1`. Then `f(l)` will be less than the goal, and `f(h)` will be greater than the goal, and the code can decide which is better. To fix the code, change the loop to `while (l < h-1)`, and at the end of the loop, `if (computed_goal < goal) l=n else h=n`. – user3386109 Jun 19 '23 at 19:28

3 Answers3

3

The binary search would like a monotonic binary predicate to work.
A binary predicate is a function with two possible values: true and false.
A monotonic binary predicate is true up to some point and false after, or vice versa.

Suppose the problem is "find the X such that value f(X) is closest to Y".
Further, suppose f(X) is increasing.

First, we reformulate the problem as one with a binary predicate, for example, "find the smallest X such that f(X) >= Y". Or "find the largest X such that f(X) < Y". Now, these are problems solvable by binary search. With the intent of, after solving this simpler problem, return to the original problem in some simple final step: "check X and X-1" or "check X and X+1", respectively, and pick the best of the two.

Next, I'd like to show an asymmetric version of binary search, instead of one excluding endpoints from both sides. Binary search doesn't have to work this way, but this may show the method from a different point of view than you already have, which may help understanding.

Let us solve the "find the largest X such that f(X) < Y" version using binary search. Formally, the value of the binary predicate pred(X) is true if f(X) < Y and false if f(X) >= Y. Now, it would be convenient to maintain a semi-interval [lo, hi), such that pred(lo) = true and pred(hi) = false. Initially, lo is the inclusive initial lower bound, and hi is the exclusive initial upper bound. Check the middle value me = (lo + hi) / 2; round to either side, this version will work either way. If pred(me) = true, then [me, hi) is the new interval. If pred(me) = false, then [lo, me) is the new interval. Proceed until lo + 1 = hi: this means the only point in [lo, hi) is lo, and it is the answer (to our binary search subproblem, not to the original problem).

Gassa
  • 8,546
  • 3
  • 29
  • 49
  • thank you. I've seen in some binary searches, including Jean's answer where he quotes Wiki's binary search using `L := m + 1`, an assignment that is also present in my original code, where I have `$l = $n + 1`. But, it is not suitable for my problem, because the new interval will be `[me + 1, hi)`, and we have no guarantee that pred(me + 1) is true. (It is false when `m + 1 = h`). In other words, to maintain the predicate, we must use `lo := m` assignment. Why can some implementations afford having `+1` and some do not? Is it the matter of predicates being defined differently? – Dennis Jun 19 '23 at 21:48
  • 1
    @Dennis I'd say that we get exclusive intervals when we have a three-way comparison at the base of the search: we compare some two values, a and b, and have three possible outcomes, "a < b", "a = b", and "a > b". Some problems can be formulated in such terms, some can't. Personally, I prefer two-way comparison ("less" or "not less") because it's simpler, and simpler solution means less bugs, on average, in its implementation. For a similar reason, I prefer semi-intervals: they provide a solution without the dreaded +-1 at all, and, again, it's simpler to reason about, and harder to make a bug. – Gassa Jun 19 '23 at 22:10
1

I was told before that you should (always) exclude the endpoints that you've already checked, because you've already checked them, so it does not make sense to include them in future search bounds. So like in this case I've checked 3001, and in any future iterations I exclude that value from further checks. I am guessing that is incorrect, and I am guessing I must include them(?) Can you help me with theory part of this?

That would only be correct if "checking" an "endpoint" means you verify that it's not the right answer.

Think about looking up a word in a real-world dictionary. You can start by opening the book to the center and checking whether the word is above or below that point in the dictionary, then discard the half which you know doesn't have your word, bisect the remainder, and repeat the process until you've found the page and column that should have your word. But if you open the dictionary and the word at the top of the page happens to be the word you're looking for, you don't exclude that word: you short-circuit the whole process and say you've won.

Now, what if the word isn't in the dictionary, but your job is to find the word nearest to it? You can't discard a word just because it's not the word you're looking for. You'd probably need to keep track of the closest endpoints you've encountered above and below the word you're searching for, and when you get to a point where there are no words between them then see which of those two words is "closest" to the one you're looking for.

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
1

You may always test for the base case first (a safe way to do it), so:

while (l <= h)
 {
     n = (l + h) / 2;

     x = vars / n; // vars represents a more complex computation, x depends on n
     
     computed_goal = ax^4 + bx^3 + cx^2 + dx + c; // computed_goal depends on n
     
     // base case reached ?
     if (abs(computed_goal - goal) / goal < 0.01)
         return n; // successfully matched goal value!

     // goal not reached, recurse on one of the halves
     if (computed_goal < goal)
         l = n + 1;
     else
         h = n - 1;
 }

You may also use the follwing pattern (from Wikipedia), note the if else if... and the L<=R:

function binary_search(A, n, T) is
    L := 0
    R := n − 1
    while L ≤ R do
        m := floor((L + R) / 2)
        if A[m] < T then
            L := m + 1
        else if A[m] > T then
            R := m − 1
        else:
            return m
    return unsuccessful

Be careful to ensure that your goal is always reached, if not you will end the while and must return something!

Jean-Baptiste Yunès
  • 34,548
  • 4
  • 48
  • 69