1

I have the following functions for computing the square root of a number using the Newton-Raphson method.

double get_dx(double y, double x)
{
   return (x*x - y)/(2*x);
}

double my_sqrt(double y, double initial)
{
   double tolerance = 1.0E-6;
   double x = initial;
   double dx;
   int iteration_count = 0;
   while ( iteration_count < 100 &&
           fabs(dx = get_dx(y, x)) > tolerance )
   {
      x -= dx;
      ++iteration_count;
      if ( iteration_count > 90 )
      {
         printf("Iteration number: %d, dx: %lf\n", iteration_count, dx);
      }
   }

   if ( iteration_count < 100 )
   {
      printf("Got the result to converge in %d iterations.\n", iteration_count);
   }
   else
   {
      printf("Could not get the result to converge.\n");
   }

   return x;
}

They work for most numbers. However, they don't converge when y is 1.0E21 and 1.0E23. There might be other numbers, that I don't know of yet, for which the functions don't converge.

I tested with initial values of:

  1. y*0.5 -- something to get started with.
  2. 1.0E10 -- a number in the neighborhood of the final result.
  3. sqrt(y)+1.0 -- A very close number to the final answer, obviously.

In all the cases the functions failed to converge for 1.0E21 and 1.0E23. I tried a lower number, 1.0E19, and a higher number, 1.0E25. Neither of them is a problem.

Here's a complete program:

#include <stdio.h>
#include <math.h>

double get_dx(double y, double x)
{
   return (x*x - y)/(2*x);
}

double my_sqrt(double y, double initial)
{
   double tolerance = 1.0E-6;
   double x = initial;
   double dx;
   int iteration_count = 0;
   while ( iteration_count < 100 &&
           fabs(dx = get_dx(y, x)) > tolerance )
   {
      x -= dx;
      ++iteration_count;
      if ( iteration_count > 90 )
      {
         printf("Iteration number: %d, dx: %lf\n", iteration_count, dx);
      }
   }

   if ( iteration_count < 100 )
   {
      printf("Got the result to converge in %d iterations.\n", iteration_count);
   }
   else
   {
      printf("Could not get the result to converge.\n");
   }

   return x;
}

void test(double y)
{
   double ans = my_sqrt(y, 0.5*y);
   printf("sqrt of %lg: %lg\n\n", y, ans);

   ans = my_sqrt(y, 1.0E10);
   printf("sqrt of %lg: %lg\n\n", y, ans);

   ans = my_sqrt(y, sqrt(y) + 1.0);
   printf("sqrt of %lg: %lg\n\n", y, ans);
}

int main()
{
   test(1.0E21);
   test(1.0E23);
   test(1.0E19);
   test(1.0E25);
}

and here's the output (built on 64 bit Linux, using gcc 4.8.4):

Iteration number: 91, dx: -0.000002
Iteration number: 92, dx: 0.000002
Iteration number: 93, dx: -0.000002
Iteration number: 94, dx: 0.000002
Iteration number: 95, dx: -0.000002
Iteration number: 96, dx: 0.000002
Iteration number: 97, dx: -0.000002
Iteration number: 98, dx: 0.000002
Iteration number: 99, dx: -0.000002
Iteration number: 100, dx: 0.000002
Could not get the result to converge.
sqrt of 1e+21: 3.16228e+10

Iteration number: 91, dx: 0.000002
Iteration number: 92, dx: -0.000002
Iteration number: 93, dx: 0.000002
Iteration number: 94, dx: -0.000002
Iteration number: 95, dx: 0.000002
Iteration number: 96, dx: -0.000002
Iteration number: 97, dx: 0.000002
Iteration number: 98, dx: -0.000002
Iteration number: 99, dx: 0.000002
Iteration number: 100, dx: -0.000002
Could not get the result to converge.
sqrt of 1e+21: 3.16228e+10

Iteration number: 91, dx: 0.000002
Iteration number: 92, dx: -0.000002
Iteration number: 93, dx: 0.000002
Iteration number: 94, dx: -0.000002
Iteration number: 95, dx: 0.000002
Iteration number: 96, dx: -0.000002
Iteration number: 97, dx: 0.000002
Iteration number: 98, dx: -0.000002
Iteration number: 99, dx: 0.000002
Iteration number: 100, dx: -0.000002
Could not get the result to converge.
sqrt of 1e+21: 3.16228e+10

Iteration number: 91, dx: 0.000027
Iteration number: 92, dx: 0.000027
Iteration number: 93, dx: 0.000027
Iteration number: 94, dx: 0.000027
Iteration number: 95, dx: 0.000027
Iteration number: 96, dx: 0.000027
Iteration number: 97, dx: 0.000027
Iteration number: 98, dx: 0.000027
Iteration number: 99, dx: 0.000027
Iteration number: 100, dx: 0.000027
Could not get the result to converge.
sqrt of 1e+23: 3.16228e+11

Iteration number: 91, dx: -0.000027
Iteration number: 92, dx: -0.000027
Iteration number: 93, dx: -0.000027
Iteration number: 94, dx: -0.000027
Iteration number: 95, dx: -0.000027
Iteration number: 96, dx: -0.000027
Iteration number: 97, dx: -0.000027
Iteration number: 98, dx: -0.000027
Iteration number: 99, dx: -0.000027
Iteration number: 100, dx: -0.000027
Could not get the result to converge.
sqrt of 1e+23: 3.16228e+11

Iteration number: 91, dx: 0.000027
Iteration number: 92, dx: 0.000027
Iteration number: 93, dx: 0.000027
Iteration number: 94, dx: 0.000027
Iteration number: 95, dx: 0.000027
Iteration number: 96, dx: 0.000027
Iteration number: 97, dx: 0.000027
Iteration number: 98, dx: 0.000027
Iteration number: 99, dx: 0.000027
Iteration number: 100, dx: 0.000027
Could not get the result to converge.
sqrt of 1e+23: 3.16228e+11

Got the result to converge in 35 iterations.
sqrt of 1e+19: 3.16228e+09

Got the result to converge in 6 iterations.
sqrt of 1e+19: 3.16228e+09

Got the result to converge in 1 iterations.
sqrt of 1e+19: 3.16228e+09

Got the result to converge in 45 iterations.
sqrt of 1e+25: 3.16228e+12

Got the result to converge in 13 iterations.
sqrt of 1e+25: 3.16228e+12

Got the result to converge in 1 iterations.
sqrt of 1e+25: 3.16228e+12

Can anyone explain why the above functions don't converge for 1.0E21 and 1.0E23?

R Sahu
  • 204,454
  • 14
  • 159
  • 270
  • 2
    The value you seek is `3.2e10` with a tolerance of `1e-6`. Those two numbers are 16 orders of magnitude apart, which stretches the limits of precision for a `double`. I assume you just got lucky with `1e25`. – user3386109 May 09 '16 at 05:23
  • This is a floating point accuracy problem. You need to fix your `tolerance` so that it's relative rather than absolute. For instance, if your target number is `1e11`, then the tolerance is effectively zero since the mantissa isn't large enough to accommodate a delta of `1e-6`. In other words, `1e11 + 1e6 == 1e11`. – Tom Karzes May 09 '16 at 05:25
  • @user3386109, Just for kicks, I tried it with 1.0E27 too and the functions converge for it. Strange. – R Sahu May 09 '16 at 05:29
  • @TomKarzes, see my response to the comment by user3386109. – R Sahu May 09 '16 at 05:29
  • It looks like with 1E23 and 1E21, you have hit a corner case is IEEE754 Double Precision Floating Point where you are bounding between two mirrored cases of inexactness when trying to converge. I've tried on gcc 6.1.1 same results, so there are no intervening fixes. I may have more time tomorrow to look further. – David C. Rankin May 09 '16 at 05:39
  • Sometimes the difference will be "zero". It may in fact be a good deal larger than that, but the difference will still be zero due to floating point roundoff. – Tom Karzes May 09 '16 at 05:41
  • @TomKarzes, I think you are on the right path. If you could formulate an answer with a bit more detail, that will be awesome. – R Sahu May 09 '16 at 05:43
  • Have you experimented with the square root of very small numbers, such as 2.3754E-23? Do you get meaningful results with those? – Jonathan Leffler May 09 '16 at 06:09
  • @JonathanLeffler, No, I haven't. I haven't thought of a good way to adjust `tolerance` for such small values of `y`. – R Sahu May 09 '16 at 06:16
  • I found two example in my bookshelf there may be others if I look in the right places). (1) Kernighan and Plauger, "The Elements of Programming Style", chapter 1, "Points to Ponder 1.3" — `double reldiff(double x, double y) { return fabs(x - y) / fmax(fabs(x), fabs(y)); }` (on the spot translation of Fortran — the context was a square root function). (2) Bentley "More Programming Pearls", Column 14, pseudo-code to find square root of `sum`: `eps := 1.0E-7; Z := initial_guess; loop newZ := 0.5 * (Z + Sum / Z); if abs(NewZ - Z) <= Eps * NewZ then break; Z := NewZ; end loop; return NewZ` – Jonathan Leffler May 09 '16 at 06:57
  • Note that in the context of 'More Programming Pearls', the term 'sum' made sense, and the initial guess was derived while `sum` was calculated. The calculation was actually the square root of the sum of the squares of a number of differences (`sqrt((x1-x0)*(x1-x0) + (y1-y0)*(y1-y0) + (z1-z0)*(z1-z0))`), and the initial guess was the absolute value of the biggest of the differences `(x1-x0)`, `(y1-y0)`, `(z1-z0)` — it makes sense when you read the book! – Jonathan Leffler May 09 '16 at 07:04
  • @JonathanLeffler, Thanks for the info. I don't have either of the books at my disposal. I'll see what I can do :) – R Sahu May 09 '16 at 13:59
  • "Elements" is a marvellous book, but also rather old (I have the 1978 second edition). It uses Fortran and PL/1 for its examples — which wasn't as idiosyncratic then as it would be considered now; when the first edition (1974) was published, C had not yet been published — and things like Java, C#, C++ and so on weren't even a gleam in the eye. It's actually readable even if you know neither Fortran or PL/1, but if you have access to a (very) good library, it might be more sensible to get it from there. Both of Bentley's "Pearls" books are, indeed, pearls — valuable to have around. – Jonathan Leffler May 09 '16 at 14:07

2 Answers2

6

This answer specifically explains why the algorithm fails to converge for an input of 1e23.

The issue that you're facing is known as the "small difference of large numbers". Specifically, you're computing (x*x - y)/(2*x) where y is 1e23, and x is approximately 3.16e11.

The IEEE-754 encoding for 1e23 is 0x44b52d02c7e14af6. Hence the exponent is 0x44b, which is 1099 decimal. That has to be reduced by 1023, since the exponent is encoded as offset-binary. Then you have to subtract another 52 to find the weight of an LSB.

0x44b ==>    1099 - 1023 - 52 = 24    ==> 1 LSB = 2^24

So a single LSB has the value 16777216.

The result from this code

double y = 1e23;
double x = sqrt(y);
double dx = x*x - y;
printf( "%lf\n", dx );

is in fact -16777216.

As a result, when you compute x*x - y, the result is either going to be zero, or it's going to be a multiple of 16777216. Dividing 16777216 by 2*x which is 2*3.16e11 gives an error of 0.000027, which is not within your tolerance.

The two closest values to the sqrt(y) are

0x4252682931c035a0
0x4252682931c035a1

and when you square those numbers you get

0x44b52d02c7e14af5
0x44b52d02c7e14af7

So neither one matches the desired result, which is

0x44b52d02c7e14af6

and hence the algorithm can never converge.


The reason the algorithm works for 1e25 is due to luck. The encoding for 1e25 is

0x45208b2a2c280291

The encoding for the sqrt(1e25) is

0x428702337e304309

and when you square that number, you get

0x45208b2a2c280291

Hence x*x - y is exactly 0, and the algorithm gets lucky.

user3386109
  • 34,287
  • 7
  • 49
  • 68
2

This is a floating point accuracy problem, in conjunction with the use of a fixed tolerance.

The problem is that, if the magnitude of the numbers being subtracted is large enough, you can get one of two effects: (1) a "false" zero difference, which isn't a problem, or (2) failed convergence due to the difference being persistently greater than the tolerance.

Another problem with using a fixed tolerance of 1e-6 is that it won't work for small numbers. For instance, suppose you're taking the square root of 1e-16. The square root will be 1e-8. The convergence test will incorrectly decide it's found the solution on the very first iteration. In this case, a much smaller tolerance would be needed.

A more sophisticated convergence test would probably make sense. For instance, checking if the current estimate is closer than the previous estimate.

Tom Karzes
  • 22,815
  • 2
  • 22
  • 41