4

I have seen floating point bit hacks to produce the square root as seen here fast floating point square root, but this method works for floats.

Is there a similar method for finding the integer square root without loops of a 32-bit unsigned integer? I have been scouring the web for one, but haven't seen any

(my thoughts are that a pure binary representation doesn't have enough information to do it, but since it is constrained to 32-bit I would guess otherwise)

yosmo78
  • 489
  • 4
  • 13

3 Answers3

5

This answer assumes that the target platform does not have floating-point support, or very slow floating-point support (perhaps via emulation).

As has been pointed out in comments, a count leading zeros (CLZ) instruction can be used to provide the fast log2 functionality that is provided via the exponent part of floating-point operands. CLZ can also be emulated with reasonable efficiency on platforms that don't provide the functionality via an intrinsic, as shown below.

An initial approximation good for a few bits can be pulled from a lookup table (LUT), which can be further refined by Newton iterations just like in the floating-point case. One to two iterations will typically be sufficient for a 32-bit integer square root. The ISO-C99 code below shows working exemplary implementation including an exhaustive test.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

uint8_t clz (uint32_t a); // count leading zeros
uint32_t umul_16_16 (uint16_t a, uint16_t b); // 16x16 bit multiply
uint16_t udiv_32_16 (uint32_t x, uint16_t y); // 32/16 bit division

/* LUT for initial square root approximation */
static const uint16_t sqrt_tab[32] = 
{ 
    0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
    0x85ff, 0x8cff, 0x94ff, 0x9aff, 0xa1ff, 0xa7ff, 0xadff, 0xb3ff,
    0xb9ff, 0xbeff, 0xc4ff, 0xc9ff, 0xceff, 0xd3ff, 0xd8ff, 0xdcff, 
    0xe1ff, 0xe6ff, 0xeaff, 0xeeff, 0xf3ff, 0xf7ff, 0xfbff, 0xffff
};

/* table lookup for initial guess followed by division-based Newton iteration */
uint16_t my_isqrt (uint32_t x)
{
    uint16_t q, lz, y, i, xh;

    if (x == 0) return x; // early out, code below can't handle zero

    // initial guess based on leading 5 bits of argument normalized to 2.30
    lz = clz (x);
    i = ((x << (lz & ~1)) >> 27);
    y = sqrt_tab[i] >> (lz >> 1);
    xh = x >> 16; // use for overflow check on divisions

    // first Newton iteration, guard against overflow in division
    q = 0xffff;
    if (xh < y) q = udiv_32_16 (x, y);
    y = (q + y) >> 1;

    if (lz < 10) {
        // second Newton iteration, guard against overflow in division
        q = 0xffff;
        if (xh < y) q = udiv_32_16 (x, y);
        y = (q + y) >> 1;
    }

    if (umul_16_16 (y, y) > x) y--; // adjust quotient if too large

    return y; // (uint16_t)sqrt((double)x)
}

static const uint8_t clz_tab[32] = 
{
    31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
    23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0
};

/* count leading zeros (for non-zero argument); a machine instruction on many architectures */
uint8_t clz (uint32_t a)
{
    a |= a >> 16;
    a |= a >> 8;
    a |= a >> 4;
    a |= a >> 2;
    a |= a >> 1;
    return clz_tab [0x07c4acdd * a >> 27];
}

/* 16x16->32 bit unsigned multiply; machine instruction on many architectures */
uint32_t umul_16_16 (uint16_t a, uint16_t b)
{
    return (uint32_t)a * b;
}

/* 32/16->16 bit division. Note: Will overflow if x[31:16] >= y */
uint16_t udiv_32_16 (uint32_t x, uint16_t y)
{
    uint16_t r = x / y;
    return r;
}

int main (void)
{
    uint32_t x;
    uint16_t res, ref;
    
    printf ("testing 32-bit integer square root\n");
    x = 0;
    do {
        ref = (uint16_t)sqrt((double)x);
        res = my_isqrt (x);
        if (res != ref) {
            printf ("error: x=%08x  res=%08x  ref=%08x\n", x, res, ref);
            printf ("exhaustive test FAILED\n");
            return EXIT_FAILURE;
        }
        x++;
    } while (x);
    printf ("exhaustive test PASSED\n");
    return EXIT_SUCCESS;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
4

Below are some variants of the code given by @njuffa, including some that are not just loop-free, but also branch-free (the ones that aren't branch free are almost branch free). All of the variants below are derived from the algorithm that's used internally for Python's math.isqrt. The variants are:

  • 32-bit square root: almost branch-free, 1 division
  • 32-bit square root: branch-free, 1 division
  • 32-bit square root: branch-free, 3 divisions, no lookup table
  • 64-bit square root: almost branch-free, 2 divisions
  • 64-bit square root: branch-free, 4 divisions, no lookup table

To finish, we give a very fast division free (and still loop-free, almost branch-free) square root algorithm for unsigned 64-bit inputs. The catch is that it only works for inputs already known to be square; for non-square inputs it will give results that are not useful. It manages to compute each square root in around 3.3 nanoseconds on my 2018 2.7 GHz Intel Core i7 laptop. Scroll all the way down to the bottom of the answer to see it.

32-bit square root: almost branch-free, 1 division

The first variant is the one I'd recommend, all other things being equal. It uses:

  • a single lookup into a 192-byte lookup table
  • a single 32-bit-by-16-bit division with 16-bit result
  • a single 16-bit-by-16-bit multiplication with 32-bit result
  • a count-leading-zeros operation, and a handful of shifts and other cheap operations

After the early return for the case x == 0, it's essentially branch free. (There's a hint of a branch in the final correction step, but the compilers I tried manage to give jump-free code for this.)

Here's the code. Explanations follow.

#include <stdint.h>

// count leading zeros of nonzero 32-bit unsigned integer
int clz32(uint32_t x);

// isqrt32_tab[k] = isqrt(256*(k+64)-1) for 0 <= k < 192
static const uint8_t isqrt32_tab[192] = {
    127, 128, 129, 130, 131, 132, 133, 134, 135, 136,
    137, 138, 139, 140, 141, 142, 143, 143, 144, 145,
    146, 147, 148, 149, 150, 150, 151, 152, 153, 154,
    155, 155, 156, 157, 158, 159, 159, 160, 161, 162,
    163, 163, 164, 165, 166, 167, 167, 168, 169, 170,
    170, 171, 172, 173, 173, 174, 175, 175, 176, 177,
    178, 178, 179, 180, 181, 181, 182, 183, 183, 184,
    185, 185, 186, 187, 187, 188, 189, 189, 190, 191,
    191, 192, 193, 193, 194, 195, 195, 196, 197, 197,
    198, 199, 199, 200, 201, 201, 202, 203, 203, 204,
    204, 205, 206, 206, 207, 207, 208, 209, 209, 210,
    211, 211, 212, 212, 213, 214, 214, 215, 215, 216,
    217, 217, 218, 218, 219, 219, 220, 221, 221, 222,
    222, 223, 223, 224, 225, 225, 226, 226, 227, 227,
    228, 229, 229, 230, 230, 231, 231, 232, 232, 233,
    234, 234, 235, 235, 236, 236, 237, 237, 238, 238,
    239, 239, 240, 241, 241, 242, 242, 243, 243, 244,
    244, 245, 245, 246, 246, 247, 247, 248, 248, 249,
    249, 250, 250, 251, 251, 252, 252, 253, 253, 254,
    254, 255,
};

// integer square root of 32-bit unsigned integer
uint16_t isqrt32(uint32_t x)
{
    if (x == 0) return 0;

    int lz = clz32(x) & 30;
    x <<= lz;
    uint16_t y = 1 + isqrt32_tab[(x >> 24) - 64];
    y = (y << 7) + (x >> 9) / y;
    y -= x < (uint32_t)y * y;
    return y >> (lz >> 1);
}

Above, we omitted the definition of the function clz32, which can be implemented in exactly the same way as clz in @njuffa's post. With gcc or Clang, you can use something like __builtin_clz in place of clz32. If you have to roll your own, steal it from @njuffa's answer.

Explanation: after handling the special case x = 0, we start by normalising x, shifting by an even integer (effectively multiplying by a power of 4) so that 2**30 <= x < 2**32. We then compute the integer square root y for x, and shift y back to compensate just before returning.

This brings us to the central three lines of code, which are the heart of the algorithm:

    uint16_t y = 1 + isqrt32_tab[(x >> 24) - 64];
    y = (y << 7) + (x >> 9) / y;
    y -= x < (uint32_t)y * y;

Remarkably, these three lines are all that's needed to compute an integer square root y of a 32-bit integer x under the assumption that 2**30 <= x < 2**32.

The first of the three lines uses the lookup table to retrieve an approximation to the square root of the topmost 16 bits of x, x >> 16. The approximation will always have an error smaller than 1: that is, after the first line is executed, the condition:

(y-1) * (y-1) < (x >> 16) && (x >> 16) < (y+1) * (y+1)

will always be true.

The second line is the most interesting one: before that line is executed, y is an approximation to the square root of x >> 16, with around 7-8 bits of accuracy. It follows that y << 8 is an approximation to the square root of x, again with around 7-8 accurate bits. So a single Newton iteration applied to y << 8 should give a better approximation to x, roughly doubling the number of accurate bits, and that's exactly what the second line computes. The really beautiful part is that if you work through the mathematics, it turns out that you can prove that the resulting y again has error smaller than 1: that is, the condition

(y-1) * (y-1) < x && x < (y+1) * (y+1)

will be true. That means that either y*y <= x and y is already the integer square root of x, or x < y*y and y - 1 is the integer square root of x. So the third line applies that -1 correction to y in the event that x < y*y.

There's one slightly subtle point above. If you follow through the possible sizes of y as the algorithm progresses: after the lookup we have 128 <= y <= 256. After the second line, mathematically we have 32768 <= y <= 65536. But if y = 65536 then we have a problem, since y can no longer be stored in a uint16_t. However, it's possible to prove that that can't happen: roughly, the only way for y to end up being that large is if the result of the lookup is 256, and in that case, it's straightforward to see that the next line can only produce a maximum of 65535.

32-bit square root: branch-free, 1 division

The algorithm above is tantalisingly close to being completely branch free. Here's a variant that's genuinely branch free, at least with the compilers I tried. It uses the same clz32 and lookup table as the previous example.

uint16_t isqrt32(uint32_t x)
{
    int lz = clz32(x | 1) & 30;
    x <<= lz;
    int index = (x >> 24) - 64;
    uint16_t y = 1 + sqrt32_tab[index >= 0 ? index : 0];
    y = (y << 7) + (x >> 9) / y;
    y -= x < (uint32_t)y * y;
    return y >> (lz >> 1);
}

Here we're simply letting the special case x == 0 propagate through the algorithm as usual. We compute clz32(x | 1) in place of clz32(x), partly because clz32(0) may not be well-defined (it isn't for gcc's __builtin_clz, for example), and partly because even if it were well-defined, the resulting shift of 32 would give us undefined behaviour in x <<= lz on the following line. And we have to adjust our lookup to correct for a possibly negative lookup index. (We could also extend the lookup table from 192 entries to 256 entries just to accommodate this case, but that seems wasteful.) Most platforms should have compilers clever enough to turn index >= 0 ? index : 0 into something branch-free. Some architectures even provide a saturating subtraction instruction.

32-bit square root: branch-free, 3 divisions, no lookup table

The next variant needs no lookup table, but uses three divisions instead of one.

uint16_t isqrt32(uint32_t x)
{
    int lz = clz32(x | 1) & 30;
    x <<= lz;
    uint32_t y = 1 + (x >> 30);
    y = (y << 1) + (x >> 27) / y;
    y = (y << 3) + (x >> 21) / y;
    y = (y << 7) + (x >> 9) / y;
    y -= x < (uint32_t)y * y;
    return y >> (lz >> 1);
}

The idea is exactly the same as before, except that we start with smaller approximations.

  • After the initial line uint32_t y = 1 + (x >> 30);, y is an approximation to the square root of x >> 28.
  • After y = (y << 1) + (x >> 27) / y;, y is an approximation to the square root of the eight topmost significant bits of x, x >> 24
  • After y = (y << 3) + (x >> 21) / y;, y is an approximation to the square root of the sixteen topmost significant bits of x, x >> 16.
  • The rest of the algorithm proceeds as before.

64-bit square root: almost branch-free, 2 divisions

The OP asked for an unsigned 32-bit square root, but the technique above works just as well for computing the integer square root of an unsigned 64-bit integer.

Here's some code for that case, which is essentially the same code that Python's math.isqrt uses for inputs smaller than 2**64 (and also to provide the starting guesses for larger inputs).

#include <stdint.h>

// count leading zeros of nonzero 64-bit unsigned integer
int clz64(uint64_t x);

// isqrt64_tab[k] = isqrt(256*(k+65)-1) for 0 <= k < 192
static const uint8_t isqrt64_tab[192] = {
    128, 129, 130, 131, 132, 133, 134, 135, 136, 137,
    138, 139, 140, 141, 142, 143, 143, 144, 145, 146,
    147, 148, 149, 150, 150, 151, 152, 153, 154, 155,
    155, 156, 157, 158, 159, 159, 160, 161, 162, 163,
    163, 164, 165, 166, 167, 167, 168, 169, 170, 170,
    171, 172, 173, 173, 174, 175, 175, 176, 177, 178,
    178, 179, 180, 181, 181, 182, 183, 183, 184, 185,
    185, 186, 187, 187, 188, 189, 189, 190, 191, 191,
    192, 193, 193, 194, 195, 195, 196, 197, 197, 198,
    199, 199, 200, 201, 201, 202, 203, 203, 204, 204,
    205, 206, 206, 207, 207, 208, 209, 209, 210, 211,
    211, 212, 212, 213, 214, 214, 215, 215, 216, 217,
    217, 218, 218, 219, 219, 220, 221, 221, 222, 222,
    223, 223, 224, 225, 225, 226, 226, 227, 227, 228,
    229, 229, 230, 230, 231, 231, 232, 232, 233, 234,
    234, 235, 235, 236, 236, 237, 237, 238, 238, 239,
    239, 240, 241, 241, 242, 242, 243, 243, 244, 244,
    245, 245, 246, 246, 247, 247, 248, 248, 249, 249,
    250, 250, 251, 251, 252, 252, 253, 253, 254, 254,
    255, 255,
};

// integer square root of a 64-bit unsigned integer
uint32_t isqrt64(uint64_t x)
{
    if (x == 0) return 0;

    int lz = clz64(x) & 62;
    x <<= lz;
    uint32_t y = isqrt64_tab[(x >> 56) - 64];
    y = (y << 7) + (x >> 41) / y;
    y = (y << 15) + (x >> 17) / y;
    y -= x < (uint64_t)y * y;
    return y >> (lz >> 1);
}

The above code assumes the existence of a function clz64 for counting leading zeros in a uint64_t value. The function clz64 is permitted to have undefined behaviour in the case x == 0. Again, on gcc and Clang, __builtin_clzll should be usable in place of clz64, assuming that unsigned long long has a width of 64. For completeness, here's an implementation of clz64, based on the usual de Bruijn sequence trickery.

#include <assert.h>

static const uint8_t clz64_tab[64] = {
    63,  5, 62,  4, 16, 10, 61,  3, 24, 15, 36,  9, 30, 21, 60,  2,
    12, 26, 23, 14, 45, 35, 43,  8, 33, 29, 52, 20, 49, 41, 59,  1,
     6, 17, 11, 25, 37, 31, 22, 13, 27, 46, 44, 34, 53, 50, 42,  7,
    18, 38, 32, 28, 47, 54, 51, 19, 39, 48, 55, 40, 56, 57, 58,  0,
};

// count leading zeros of nonzero 64-bit unsigned integer. Analogous to the
// 32-bit version at
// https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn.
int clz64(uint64_t x)
{
    assert(x);
    x |= x >> 1;
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;
    x |= x >> 32;
    return clz64_tab[(uint64_t)(x * 0x03f6eaf2cd271461u) >> 58];
}

Doing exhaustive testing for a 64-bit input is no longer reasonable without a supercomputer to hand, but checking all inputs of the form s*s, s*s + s, and s*s + 2*s for 0 <= s < 2**32 is feasible, and gives some confidence that the code is working correctly. The following code performs that check. It takes around 148 seconds to run to completion on my Intel Core i7-based laptop, which works out at around 11.5ns per square root computation. That comes down to around 9.2ns per square root if I replace the custom clz64 implementation with __builtin_clzll. (Both of those timings still include the overhead of the testing code, of course.)

#include <stdio.h>

int check_isqrt64(uint64_t x) {
    uint64_t y = isqrt64(x);
    int y_ok = y*y <= x && x - y*y <= 2*y;
    if (!y_ok) {
        printf("isqrt64(%llu) returned incorrect answer %llu\n", x, y);
    }
    return y_ok;
}

int main(void)
{
    printf("Checking isqrt64 for selected values in [0, 2**64) ...\n");
    for (uint64_t s = 0; s < 0x100000000u; s++) {
        if (!check_isqrt64(s*s)) return 1;
        if (!check_isqrt64(s*s + s)) return 1;
        if (!check_isqrt64(s*s + 2*s)) return 1;
    };
    printf("All tests passed\n");
    return 0;
}

64-bit square root: branch-free, 4 divisions, no lookup table

The final variant gives a branch-free, lookup-table-free implementation of integer square root for a 64-bit input, just to demonstrate that it's possible. It needs 4 divisions. On most machines, those divisions will be slow enough that this variant is slower than the previous variant.

uint32_t isqrt64(uint64_t x)
{
    int lz = clz64(x | 1) & 62;
    x <<= lz;
    uint32_t y = 2 + (x >> 63);
    y = (y << 1) + (x >> 59) / y;
    y = (y << 3) + (x >> 53) / y;
    y = (y << 7) + (x >> 41) / y;
    y = (y << 15) + (x >> 17) / y;
    y -= x < (uint64_t)y * y;
    return y >> (lz >> 1);
}

64-bit square root of exact square; division-free!

Finally, as promised, here's an algorithm that's a bit different from those above. It computes the square root of an input that's already known to be square, using a Newton iteration for the inverse square root of the input, and operating in the 2-adic domain rather than the real domain. It requires no divisions, but it uses a lookup table, as well as relying on GCC's __builtin_ctzl intrinsic to count trailing zeros efficiently.

#include <stdint.h>

static const uint8_t lut[128] = {
    0, 85, 83, 102, 71, 2, 36, 126, 15, 37, 28, 22, 87, 50, 107, 46,
    31, 10, 115, 57, 103, 98, 4, 33, 47, 58, 3, 118, 119, 109, 116, 113,
    63, 106, 108, 38, 120, 61, 27, 62, 79, 101, 35, 41, 104, 13, 84, 17,
    95, 53, 76, 121, 88, 34, 59, 97, 111, 5, 67, 54, 72, 82, 52, 78,
    127, 42, 44, 25, 56, 125, 91, 1, 112, 90, 99, 105, 40, 77, 20, 81,
    96, 117, 12, 70, 24, 29, 123, 94, 80, 69, 124, 9, 8, 18, 11, 14,
    64, 21, 19, 89, 7, 66, 100, 65, 48, 26, 92, 86, 23, 114, 43, 110,
    32, 74, 51, 6, 39, 93, 68, 30, 16, 122, 60, 73, 55, 45, 75, 49,
};

uint32_t isqrt64_exact(uint64_t n)
{
    uint32_t m, k, x, b;

    if (n == 0)
        return 0;

    int j = __builtin_ctzl(n);
    n >>= j;
    m = (uint32_t)n;
    k = (uint32_t)(n >> 2);
    x = lut[k >> 1 & 127];
    x += (m * x * ~x - k) * (x - ~x);
    x += (m * x * ~x - k) * (x - ~x);
    b = m * x + 2 * k;
    b ^= -(b >> 31);
    return (b - ~b) << (j >> 1);
}

For purists, it's not hard to rewrite it to be completely branch free, to eliminate the lookup table, and to make the code completely portable and C99-compliant. Here's one way to do that:

#include <stdint.h>

uint32_t isqrt64_exact(uint64_t n)
{
    uint64_t t = (n & -n) - 1 & 0x5555555555555555U;
    t = t + (t >> 2) & 0x3333333333333333U;
    t = t + (t >> 4) & 0x0f0f0f0f0f0f0f0fU;
    int j = (uint64_t)(t * 0x0101010101010101U) >> 56;
    n >>= 2 * j;

    uint32_t m, k, x, b;
    m = (uint32_t)n;
    k = (uint32_t)(n >> 2);
    x = k * ~k;
    x += (m * x * ~x - k) * (x - ~x);
    x += (m * x * ~x - k) * (x - ~x);
    x += (m * x * ~x - k) * (x - ~x);
    b = m * x + 2 * k;
    b ^= -(b >> 31);
    return (2 * b | (m & 1)) << j;
}

For a complete explanation of how this code works, along with code to do exhaustive testing, see the GitHub gist at https://gist.github.com/mdickinson/e087001d213725a93eeb8d8f447a2f40.

Mark Dickinson
  • 29,088
  • 9
  • 83
  • 120
3

No. You'd need to introduce a log somewhere; the fast floating point square root works because of the log in the bit representation.

The fastest method is probably a lookup table of n -> floor(sqrt(n)). You don't store all the values in the table, but only the values for which the square root changes. Use binary search to find the result in the table in log(n) time.

Joshua
  • 40,822
  • 8
  • 72
  • 132