why is 'fast inverse square root' slower than 1/sqrt() at extremely large float?

Question

The following full code could compare speed of fast inverse square root with 1/sqrt(). According to this sentence in wikipedia, (i.e. The algorithm was approximately four times faster than computing the square root with another method and calculating the reciprocal via floating point division.)

But here is why I am here: it is slower than 1/sqrt(). something wrong in my code? please.

    #include <stdio.h>
    #include <time.h>
    #include <math.h>

    float FastInvSqrt (float number);

    int
    main ()
    {
      float x = 1.0e+100;

      int N = 100000000;
      int i = 0;

      clock_t start2 = clock (); 
      do  
        {   
          float z = 1.0 / sqrt (x);
          i++;
        }   
      while (i < N); 
      clock_t end2 = clock (); 

      double time2 = (end2 - start2) / (double) CLOCKS_PER_SEC;

      printf ("1/sqrt() spends %13f sec.\n\n", time2);

      i = 0;
      clock_t start1 = clock (); 
      do  
        {   
          float y = FastInvSqrt (x);
          i++;
        }   
      while (i < N); 
      clock_t end1 = clock (); 

      double time1 = (end1 - start1) / (double) CLOCKS_PER_SEC;



      printf ("FastInvSqrt() spends %f sec.\n\n", time1);


      printf ("fast inverse square root is faster %f times than 1/sqrt().\n", time2/time1);

      return 0;
}

float
FastInvSqrt (float x)
{
  float xhalf = 0.5F * x;
  int i = *(int *) &x;  // store floating-point bits in integer
  i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
  x = *(float *) &i;            // convert new bits into float
  x = x * (1.5 - xhalf * x * x);        // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  return x;
}

The result is as follows:

1/sqrt() spends      0.850000 sec.

FastInvSqrt() spends 0.960000 sec.

fast inverse square root is faster 0.885417 times than 1/sqrt().

On which value(s) are you testing your function? There could be an initial constant penalty in just running the code with, e.g., an input of 1. So maybe try some very large values, and do your benchmarking on that. — Tim Biegeleisen, Jul 24 '18 at 01:42
if I use float x = 1.0e+100; 1/sqrt() spends 0.810000 sec. FastInvSqrt() spends 0.880000 sec. fast inverse square root is faster 0.920455 times than 1/sqrt(). — San Tseng, Jul 24 '18 at 01:52
so 'Fast inverse square root' is slower than 1/sqrt(), if I use float x = 1.0e+100; — San Tseng, Jul 24 '18 at 01:54
Your code is garbage because the loop contents don't use the results of the square root algorithms. Recode so that the loops calculate a different reciprocal square root every time and print out the sum of the results and enable optimization. Also your inverse square root takes at least 3 iterations to converge to single precision accuracy and should be computed perhaps 8 abreast so that the compiler can vectorize the operations with ymm regisers. — user5713492, Jul 24 '18 at 01:57
@user5713492 that only takes effect if you compile with optimizations, in which case what you coded is no longer what's being executed anyways- you'd have to compare the assembly. For the record, I modified OP's code to increment `X` and reset it to 10 between tests, and got the following: `1/sqrt() spends 0.105822 sec. FastInvSqrt() spends 0.000001 sec.`, so I think it would be necessary to generate a random array of values to test ahead of time. — Dillon Davis, Jul 24 '18 at 02:03
@user5713492 I only use 1 iteration. please see it carefully — San Tseng, Jul 24 '18 at 02:08
@San Tseng The reference for Wikipedia's claim that `FastInvSqrt` is four times faster than a particular alternative is a paper that was published in 2003, fifteen years ago. There is no particular reason to assume that what was true for the hardware platforms, toolchains, and libraries in 2003 still applies today, as all of those have changed substantially since then (e.g. the hardware instruction for square root on x86 processors have become *much* faster since then). — njuffa, Jul 24 '18 at 02:13
@DillonDavis Timing tests with optimization disabled are futile. The tests **must** calculate `N` different reciprocal square roots and **must** use all of the results in output, otherwise the optimizer can just skip the loops. To the O.P. your loops as presented execute 100000001 iterations, not just 1. If it took about a second to compute just 1 reciprocal square root, well, that would be just pitiful and it's not what is happening. — user5713492, Jul 24 '18 at 02:14
@njuffa processors can crunch out 2X 8-wide FMACs per clock cycle per core at single precision nowadays, as well, Norbert. — user5713492, Jul 24 '18 at 02:16
@user5713492 tests comparing algorithms with optimization disabled are futile, but a test of bitwise and pointer hacks with optimizations enabled is worthless. Your compiler is going to transform what you wrote with those same micro-optimizations you're trying to test, so there's no way to determine what operations / tricks in the source code really worked without diving into assembly. — Dillon Davis, Jul 24 '18 at 02:19
user5713492- San Tseng is trying to say newton's method only applied one iteration, San Tseng- user5713492 is trying to say you need three iterations to get an accurate result. — Dillon Davis, Jul 24 '18 at 02:21
Sorry for the delay in replying. I mean I only use 1 iteration in Newton's root-finding method. — San Tseng, Jul 24 '18 at 02:28
@user5713492 The floating-point units as a whole have gotten wider, but that applies to square roots and multiplies alike. Last I checked, the ratio of throughput between square roots and add/mul/fma had gotten significantly smaller than it was fifteen years ago, from 1:25 down to 1:6 (or thereabouts). — njuffa, Jul 24 '18 at 02:30

Efin Fon · Answer 1 · 2021-06-24T07:31:27.983

A function that reduces the domain in which it computes with precision will have less computational complexity (meaning that it can be computed faster). This can be thought of as optimizing the computation of a function's shape for a subset of its definition, or like search algorithms which each are best for a particular kind of input (No Free Lunch theorem).

As such, using this function for inputs outside the interval [0, 1] (which I suppose it was optimized / designed for) means using it in the subset of inputs where its complexity is worse (higher) than other possibly specialized variants of functions that compute square roots.

The sqrt() function you are using from the library was itself (likely) also optimized, as it has pre-computed values in a sort of LUT (which act as initial guesses for further approximations); using such a more "general function" (meaning that it covers more of the domain and tries to efficientize it by precomputation, for example; or eliminating redundant computation, but that is limited; or maximizing data reuse at run-time) has its complexity limitations, because the more choices between which precomputation to use for an interval, the more decision overhead there is; so knowing at compile-time that all your inputs to sqrt are in the interval [0, 1] would help reduce the run-time decision overhead, as you would know ahead of time which specialized approximation function to use (or you could generate specialized functions for each interval of interest, at compile-time -> see meta-programming for this).

score 0 · Answer 2 · answered Jul 24 '18 at 03:10

I correct my code as follows: 1. compute random number, instead of a fixed number. 2. count time consumption inside while loop and sum of it.

#include <stdio.h>
#include <time.h>
#include <math.h>
#include <stdlib.h>

float FastInvSqrt (float number);

int
main ()
{
  float x=0;
  time_t t;

  srand((unsigned) time(&t));

  int N = 1000000;
  int i = 0;

  double sum_time2=0.0;

  do  
    {   
      x=(float)(rand() % 10000)*0.22158;
  clock_t start2 = clock (); 
      float z = 1.0 / sqrt (x);
  clock_t end2 = clock (); 
        sum_time2=sum_time2+(end2-start2);
      i++;
    }   
  while (i < N); 


  printf ("1/sqrt() spends %13f sec.\n\n", sum_time2/(double)CLOCKS_PER_SEC);

  double sum_time1=0.0;

  i = 0;
  do  

    {
      x=(float)(rand() % 10000)*0.22158;
  clock_t start1 = clock ();
      float y = FastInvSqrt (x);
  clock_t end1 = clock ();
        sum_time1=sum_time1+(end1-start1);
      i++;
    }
  while (i < N);

  printf ("FastInvSqrt() spends %f sec.\n\n", sum_time1/(double)CLOCKS_PER_SEC);

  printf ("fast inverse square root is faster %f times than 1/sqrt().\n", sum_time2/sum_time1);

  return 0;
}

float
FastInvSqrt (float x)
{
  float xhalf = 0.5F * x;
  int i = *(int *) &x;  // store floating-point bits in integer
  i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
  x = *(float *) &i;            // convert new bits into float
  x = x * (1.5 - xhalf * x * x);        // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  //x = x * (1.5 - xhalf * x * x);      // One round of Newton's method
  return x;
}

but fast inverse square root still slower that 1/sqrt().

1/sqrt() spends      0.530000 sec.

FastInvSqrt() spends 0.540000 sec.

fast inverse square root is faster 0.981481 times than 1/sqrt().

why is 'fast inverse square root' slower than 1/sqrt() at extremely large float?

2 Answers2