Fastest way to get square root in float value

Question

I am trying to find a fastest way to make square root of any float number in C++. I am using this type of function in a huge particles movement calculation like calculation distance between two particle, we need a square root etc. So If any suggestion it will be very helpful. I have tried and below is my code

#include <math.h>
#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

#define CHECK_RANGE 100

inline float msqrt(float a)
{
    int i;
    for (i = 0;i * i <= a;i++);
    
    float lb = i - 1; //lower bound
    if (lb * lb == a)
        return lb;
    float ub = lb + 1; // upper bound
    float pub = ub; // previous upper bound
    for (int j = 0;j <= 20;j++)
    {
        float ub2 = ub * ub;
        if (ub2 > a)
        {
            pub = ub;
            ub = (lb + ub) / 2; // mid value of lower and upper bound
        }
        else
        {
            lb = ub; 
            ub = pub;
        }
    }
    return ub;
}

void check_msqrt()
{
    for (size_t i = 0; i < CHECK_RANGE; i++)
    {
        msqrt(i);
    }
}

void check_sqrt()
{
    for (size_t i = 0; i < CHECK_RANGE; i++)
    {
        sqrt(i);
    }
}

int main()
{
    auto start1 = high_resolution_clock::now();
    check_msqrt();
    auto stop1 = high_resolution_clock::now();

    auto duration1 = duration_cast<microseconds>(stop1 - start1);
    cout << "Time for check_msqrt = " << duration1.count() << " micro secs\n";


    auto start2 = high_resolution_clock::now();
    check_sqrt();
    auto stop2 = high_resolution_clock::now();

    auto duration2 = duration_cast<microseconds>(stop2 - start2);
    cout << "Time for check_sqrt = " << duration2.count() << " micro secs";
    
    //cout << msqrt(3);

    return 0;
}

output of above code showing the implemented method 4 times more slow than sqrt of math.h file. I need faster than math.h version.

How about https://en.wikipedia.org/wiki/Fast_inverse_square_root? — Luchian Grigore, Mar 03 '22 at 11:57
At least, for Intel x86, I know there is an op-code to compute square root ([FSQRT](https://www.felixcloutier.com/x86/fsqrt)), and it seems the compiler is aware of this as well: [test on Compiler Explorer](https://godbolt.org/z/661nz6Wf9). This is hard to beat in S/W... (OK. I must admit I cheated a bit. I used `long double` to make this illustrative.) ;-) — Scheff's Cat, Mar 03 '22 at 11:58
@Scheff'sCat, I think it may be updated version of math.h, also try to check for this. Thanks — Gautam Jangid, Mar 03 '22 at 13:18
@LuchianGrigore I checked for the inverse sqrt Time for check_msqrt = 4340 micro secs Time for check_sqrt = 308 micro secs Time for check_Qsqrt = 360 micro secs still not getting the desired results, also it has some error in floating point more than mentioned above two methods. In some manner it is little bit faster than version of sqrt of math.h — Gautam Jangid, Mar 03 '22 at 13:36
Your question is not giving useful information. Why do you need to compute these square roots ? Can't you do without them ? What is the accuracy required ? What time performance do you need ? [The computation of the squared distance itself also has a cost, there is no need to make a much faster square root.] — , Mar 03 '22 at 13:48
@YvesDaoust As I mentioned on my question I need the sqrt for calculation of distance between two points. I share a scenario for more details so that you can find more information. Suppose there are huge number of particles (spherical) and the interaction between them depends on the distance between two particles, so for calculating distance we have to use square root formula for distance between two points. As particles are huge in size so we need many square roots to calc it mathematically. — Gautam Jangid, Mar 03 '22 at 14:23
@YvesDaoust at the same time we are also drawing these particles (as spheres) in separate thread so that's why it is going hang the system when the number of particles increase. Hope you understand now why I need it optimized. In the current scenario math.h version making it slow. — Gautam Jangid, Mar 03 '22 at 14:23
@Scheff'sCat fsqrt is x87, not so relevant nowadays. The corresponding SSE instruction is sqrtss . Also possibly relevant are rsqrtss, vrsqrt14ss, vrsqrt28ss, etc. — Marc Glisse, Mar 03 '22 at 14:27
In fact you added very little new/useful information, and nothing of what I asked. How is the dependency on distance ? — , Mar 03 '22 at 14:29
I'm afraid to ask, [how](https://godbolt.org/z/v6rrsbE67) have you compiled this program? Have you enabled [optimizations](https://godbolt.org/z/evKn1W64z)? — Bob__, Mar 03 '22 at 14:51
@LuchianGrigore Thanks for the Wikipedia link , It worked, no need to calculate square root in the case, just using inverse square root directly. — Gautam Jangid, Mar 04 '22 at 03:13

nielsen · Accepted Answer · 2022-03-03T19:59:18.633

3

In short, I do not think it is possible to implement something generally faster than the standard library version of sqrt.

Performance is a very important parameter when implementing standard library functions and it is fair to assume that such a commonly used function as sqrt is optimized as much as possible.

Beating the standard library function would require a special case, such as:

Availability of a suitable assembler instruction - or other specialized hardware support - on the particular system for which the standard library has not been specialized.
Knowledge of the needed range or precision. The standard library function must handle all cases specified by the standard. If the application only needs a subset of that or maybe only requires an approximate result then perhaps an optimization is possible.
Making a mathematical reduction of the calculations or combine some calculation steps in a smart way so an efficient implementation can be made for that combination.

edited Mar 03 '22 at 19:59

answered Mar 03 '22 at 14:13

nielsen

5,641
10
27

This. Per Agner Fog's tables, on modern x86-64 hardware `VSQRT{SS|PS}` computes single-precision square roots with a latency of 12-15 cycles, and throughput is one every 6-7 cycles. With sufficiently high optimization level selected with various compilers, each `sqrtf()` call should map directly to such an instruction (check the disassembly!). In the general case, this performance cannot be beat by discrete replacements. – njuffa Mar 03 '22 at 20:21
Finally I got some little success to optimize the method by only using inverse square root method. In the case we are dividing with distance so it is actually inverse square root that we finally need. Special Thanks to @Luchian Grigore and all other who took time for their precious help. once again thanks to all :) – Gautam Jangid Mar 04 '22 at 03:09
1

@GautamJangid Have you managed to vectorize the code (either by auto vectorization or use of SIMD intrinsics)? That is where the real performance gains are to be had on modern hardware. – njuffa Mar 04 '22 at 04:33
@njuffa no currently using native standard, Thanks for suggestion!, I will try to use auto vectorization and loop unroll to make more optimization :) – Gautam Jangid Mar 04 '22 at 05:29

score 1 · Answer 2 · answered Mar 03 '22 at 19:59

Here's another alternative to binary search. It may not be as fast as std::sqrt, haven't tested it. But it will definitely be faster than your binary search.

auto
Sqrt(float x)
{
    using F = decltype(x);
    if (x == 0 || x == INFINITY || isnan(x))
        return x;
    if (x < 0)
        return F{NAN};
    int e;
    x = std::frexp(x, &e);
    if (e % 2 != 0)
    {
        ++e;
        x /= 2;
    }
    auto y = (F{-160}/567*x + F{2'848}/2'835)*x + F{155}/567;
    y = (y + x/y)/2;
    y = (y + x/y)/2;
    return std::ldexp(y, e/2);
}

After getting +/-0, nan, inf, and negatives out of the way, it works by decomposing the float into a mantissa in the range of [¹/₄, 1) times 2^e where e is an even integer. The answer is then sqrt(mantissa)* 2^{^e/₂}.

Finding the sqrt of the mantissa can be guessed at with a least squares quadratic curve fit in the range [¹/₄, 1]. Then that good guess is refined by two iterations of Newton–Raphson. This will get you within 1 ulp of the correctly rounded result. A good std::sqrt will typically get that last bit correct.

I got your point, Thanks for this great information. – Gautam Jangid Mar 04 '22 at 06:03 — Gautam Jangid, Mar 04 '22 at 06:03

score 0 · Answer 3 · answered Mar 03 '22 at 13:41

I have also tried with the algorithm mention in https://en.wikipedia.org/wiki/Fast_inverse_square_root, but not found desired result, please check

#include <math.h>
#include <iostream>
#include <chrono>

#include <bit>
#include <limits>
#include <cstdint>

using namespace std;
using namespace std::chrono;

#define CHECK_RANGE 10000

inline float msqrt(float a)
{
    int i;
    for (i = 0;i * i <= a;i++);
    
    float lb = i - 1; //lower bound
    if (lb * lb == a)
        return lb;
    float ub = lb + 1; // upper bound
    float pub = ub; // previous upper bound
    for (int j = 0;j <= 20;j++)
    {
        float ub2 = ub * ub;
        if (ub2 > a)
        {
            pub = ub;
            ub = (lb + ub) / 2; // mid value of lower and upper bound
        }
        else
        {
            lb = ub; 
            ub = pub;
        }
    }
    return ub;
}

/* mentioned here ->  https://en.wikipedia.org/wiki/Fast_inverse_square_root */
inline float Q_sqrt(float number)
{
    union Conv {
        float    f;
        uint32_t i;
    };
    Conv conv;
    conv.f= number;
    conv.i = 0x5f3759df - (conv.i >> 1);
    conv.f *= 1.5F - (number * 0.5F * conv.f * conv.f);
    return 1/conv.f;
}

void check_Qsqrt()
{
    for (size_t i = 0; i < CHECK_RANGE; i++)
    {
        Q_sqrt(i);
    }
}

void check_msqrt()
{
    for (size_t i = 0; i < CHECK_RANGE; i++)
    {
        msqrt(i);
    }
}

void check_sqrt()
{
    for (size_t i = 0; i < CHECK_RANGE; i++)
    {
        sqrt(i);
    }
}

int main()
{
    auto start1 = high_resolution_clock::now();
    check_msqrt();
    auto stop1 = high_resolution_clock::now();

    auto duration1 = duration_cast<microseconds>(stop1 - start1);
    cout << "Time for check_msqrt = " << duration1.count() << " micro secs\n";


    auto start2 = high_resolution_clock::now();
    check_sqrt();
    auto stop2 = high_resolution_clock::now();

    auto duration2 = duration_cast<microseconds>(stop2 - start2);
    cout << "Time for check_sqrt = " << duration2.count() << " micro secs\n";
    
    auto start3 = high_resolution_clock::now();
    check_Qsqrt();
    auto stop3 = high_resolution_clock::now();

    auto duration3 = duration_cast<microseconds>(stop3 - start3);
    cout << "Time for check_Qsqrt = " << duration3.count() << " micro secs\n";

    //cout << Q_sqrt(3);
    //cout << sqrt(3);
    //cout << msqrt(3);
    return 0;
}

Fastest way to get square root in float value

3 Answers3