Below are some variants of the code given by @njuffa, including some that are not just loop-free, but also branch-free (the ones that aren't branch free are almost branch free). All of the variants below are derived from the algorithm that's used internally for Python's math.isqrt
. The variants are:
- 32-bit square root: almost branch-free, 1 division
- 32-bit square root: branch-free, 1 division
- 32-bit square root: branch-free, 3 divisions, no lookup table
- 64-bit square root: almost branch-free, 2 divisions
- 64-bit square root: branch-free, 4 divisions, no lookup table
To finish, we give a very fast division free (and still loop-free, almost branch-free) square root algorithm for unsigned 64-bit inputs. The catch is that it only works for inputs already known to be square; for non-square inputs it will give results that are not useful. It manages to compute each square root in around 3.3 nanoseconds on my 2018 2.7 GHz Intel Core i7 laptop. Scroll all the way down to the bottom of the answer to see it.
32-bit square root: almost branch-free, 1 division
The first variant is the one I'd recommend, all other things being equal. It uses:
- a single lookup into a 192-byte lookup table
- a single 32-bit-by-16-bit division with 16-bit result
- a single 16-bit-by-16-bit multiplication with 32-bit result
- a count-leading-zeros operation, and a handful of shifts and other cheap operations
After the early return for the case x == 0
, it's essentially branch free. (There's a hint of a branch in the final correction step, but the compilers I tried manage to give jump-free code for this.)
Here's the code. Explanations follow.
#include <stdint.h>
// count leading zeros of nonzero 32-bit unsigned integer
int clz32(uint32_t x);
// isqrt32_tab[k] = isqrt(256*(k+64)-1) for 0 <= k < 192
static const uint8_t isqrt32_tab[192] = {
127, 128, 129, 130, 131, 132, 133, 134, 135, 136,
137, 138, 139, 140, 141, 142, 143, 143, 144, 145,
146, 147, 148, 149, 150, 150, 151, 152, 153, 154,
155, 155, 156, 157, 158, 159, 159, 160, 161, 162,
163, 163, 164, 165, 166, 167, 167, 168, 169, 170,
170, 171, 172, 173, 173, 174, 175, 175, 176, 177,
178, 178, 179, 180, 181, 181, 182, 183, 183, 184,
185, 185, 186, 187, 187, 188, 189, 189, 190, 191,
191, 192, 193, 193, 194, 195, 195, 196, 197, 197,
198, 199, 199, 200, 201, 201, 202, 203, 203, 204,
204, 205, 206, 206, 207, 207, 208, 209, 209, 210,
211, 211, 212, 212, 213, 214, 214, 215, 215, 216,
217, 217, 218, 218, 219, 219, 220, 221, 221, 222,
222, 223, 223, 224, 225, 225, 226, 226, 227, 227,
228, 229, 229, 230, 230, 231, 231, 232, 232, 233,
234, 234, 235, 235, 236, 236, 237, 237, 238, 238,
239, 239, 240, 241, 241, 242, 242, 243, 243, 244,
244, 245, 245, 246, 246, 247, 247, 248, 248, 249,
249, 250, 250, 251, 251, 252, 252, 253, 253, 254,
254, 255,
};
// integer square root of 32-bit unsigned integer
uint16_t isqrt32(uint32_t x)
{
if (x == 0) return 0;
int lz = clz32(x) & 30;
x <<= lz;
uint16_t y = 1 + isqrt32_tab[(x >> 24) - 64];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
Above, we omitted the definition of the function clz32
, which can be implemented in exactly the same way as clz
in @njuffa's post. With gcc or Clang, you can use something like __builtin_clz
in place of clz32
. If you have to roll your own, steal it from @njuffa's answer.
Explanation: after handling the special case x = 0
, we start by normalising x
, shifting by an even integer (effectively multiplying by a power of 4) so that 2**30 <= x < 2**32
. We then compute the integer square root y
for x
, and shift y
back to compensate just before returning.
This brings us to the central three lines of code, which are the heart of the algorithm:
uint16_t y = 1 + isqrt32_tab[(x >> 24) - 64];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
Remarkably, these three lines are all that's needed to compute an integer square root y
of a 32-bit integer x
under the assumption that 2**30 <= x < 2**32
.
The first of the three lines uses the lookup table to retrieve an approximation to the square root of the topmost 16 bits of x
, x >> 16
. The approximation will always have an error smaller than 1
: that is, after the first line is executed, the condition:
(y-1) * (y-1) < (x >> 16) && (x >> 16) < (y+1) * (y+1)
will always be true.
The second line is the most interesting one: before that line is executed, y
is an approximation to the square root of x >> 16
, with around 7-8 bits of accuracy. It follows that y << 8
is an approximation to the square root of x
, again with around 7-8 accurate bits. So a single Newton iteration applied to y << 8
should give a better approximation to x
, roughly doubling the number of accurate bits, and that's exactly what the second line computes. The really beautiful part is that if you work through the mathematics, it turns out that you can prove that the resulting y
again has error smaller than 1
: that is, the condition
(y-1) * (y-1) < x && x < (y+1) * (y+1)
will be true. That means that either y*y <= x
and y
is already the integer square root of x
, or x < y*y
and y - 1
is the integer square root of x
. So the third line applies that -1
correction to y
in the event that x < y*y
.
There's one slightly subtle point above. If you follow through the possible sizes of y
as the algorithm progresses: after the lookup we have 128 <= y <= 256
. After the second line, mathematically we have 32768 <= y <= 65536
. But if y = 65536
then we have a problem, since y
can no longer be stored in a uint16_t
. However, it's possible to prove that that can't happen: roughly, the only way for y
to end up being that large is if the result of the lookup is 256
, and in that case, it's straightforward to see that the next line can only produce a maximum of 65535
.
32-bit square root: branch-free, 1 division
The algorithm above is tantalisingly close to being completely branch free. Here's a variant that's genuinely branch free, at least with the compilers I tried. It uses the same clz32
and lookup table as the previous example.
uint16_t isqrt32(uint32_t x)
{
int lz = clz32(x | 1) & 30;
x <<= lz;
int index = (x >> 24) - 64;
uint16_t y = 1 + sqrt32_tab[index >= 0 ? index : 0];
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
Here we're simply letting the special case x == 0
propagate through the algorithm as usual. We compute clz32(x | 1)
in place of clz32(x)
, partly because clz32(0)
may not be well-defined (it isn't for gcc's __builtin_clz
, for example), and partly because even if it were well-defined, the resulting shift of 32
would give us undefined behaviour in x <<= lz
on the following line. And we have to adjust our lookup to correct for a possibly negative lookup index. (We could also extend the lookup table from 192 entries to 256 entries just to accommodate this case, but that seems wasteful.) Most platforms should have compilers clever enough to turn index >= 0 ? index : 0
into something branch-free. Some architectures even provide a saturating subtraction instruction.
32-bit square root: branch-free, 3 divisions, no lookup table
The next variant needs no lookup table, but uses three divisions instead of one.
uint16_t isqrt32(uint32_t x)
{
int lz = clz32(x | 1) & 30;
x <<= lz;
uint32_t y = 1 + (x >> 30);
y = (y << 1) + (x >> 27) / y;
y = (y << 3) + (x >> 21) / y;
y = (y << 7) + (x >> 9) / y;
y -= x < (uint32_t)y * y;
return y >> (lz >> 1);
}
The idea is exactly the same as before, except that we start with smaller approximations.
- After the initial line
uint32_t y = 1 + (x >> 30);
, y
is an approximation to the square root of x >> 28
.
- After
y = (y << 1) + (x >> 27) / y;
, y
is an approximation to the square root of the eight topmost significant bits of x
, x >> 24
- After
y = (y << 3) + (x >> 21) / y;
, y
is an approximation to
the square root of the sixteen topmost significant bits of x
, x >> 16
.
- The rest of the algorithm proceeds as before.
64-bit square root: almost branch-free, 2 divisions
The OP asked for an unsigned 32-bit square root, but the technique above works just as well for computing the integer square root of an unsigned 64-bit integer.
Here's some code for that case, which is essentially the same code that Python's math.isqrt
uses for inputs smaller than 2**64
(and also to provide the starting guesses for larger inputs).
#include <stdint.h>
// count leading zeros of nonzero 64-bit unsigned integer
int clz64(uint64_t x);
// isqrt64_tab[k] = isqrt(256*(k+65)-1) for 0 <= k < 192
static const uint8_t isqrt64_tab[192] = {
128, 129, 130, 131, 132, 133, 134, 135, 136, 137,
138, 139, 140, 141, 142, 143, 143, 144, 145, 146,
147, 148, 149, 150, 150, 151, 152, 153, 154, 155,
155, 156, 157, 158, 159, 159, 160, 161, 162, 163,
163, 164, 165, 166, 167, 167, 168, 169, 170, 170,
171, 172, 173, 173, 174, 175, 175, 176, 177, 178,
178, 179, 180, 181, 181, 182, 183, 183, 184, 185,
185, 186, 187, 187, 188, 189, 189, 190, 191, 191,
192, 193, 193, 194, 195, 195, 196, 197, 197, 198,
199, 199, 200, 201, 201, 202, 203, 203, 204, 204,
205, 206, 206, 207, 207, 208, 209, 209, 210, 211,
211, 212, 212, 213, 214, 214, 215, 215, 216, 217,
217, 218, 218, 219, 219, 220, 221, 221, 222, 222,
223, 223, 224, 225, 225, 226, 226, 227, 227, 228,
229, 229, 230, 230, 231, 231, 232, 232, 233, 234,
234, 235, 235, 236, 236, 237, 237, 238, 238, 239,
239, 240, 241, 241, 242, 242, 243, 243, 244, 244,
245, 245, 246, 246, 247, 247, 248, 248, 249, 249,
250, 250, 251, 251, 252, 252, 253, 253, 254, 254,
255, 255,
};
// integer square root of a 64-bit unsigned integer
uint32_t isqrt64(uint64_t x)
{
if (x == 0) return 0;
int lz = clz64(x) & 62;
x <<= lz;
uint32_t y = isqrt64_tab[(x >> 56) - 64];
y = (y << 7) + (x >> 41) / y;
y = (y << 15) + (x >> 17) / y;
y -= x < (uint64_t)y * y;
return y >> (lz >> 1);
}
The above code assumes the existence of a function clz64
for counting leading zeros in a uint64_t
value. The function clz64
is permitted to have undefined behaviour in the case x == 0
. Again, on gcc and Clang, __builtin_clzll
should be usable in place of clz64
, assuming that unsigned long long
has a width of 64
. For completeness, here's an implementation of clz64
, based on the usual de Bruijn sequence trickery.
#include <assert.h>
static const uint8_t clz64_tab[64] = {
63, 5, 62, 4, 16, 10, 61, 3, 24, 15, 36, 9, 30, 21, 60, 2,
12, 26, 23, 14, 45, 35, 43, 8, 33, 29, 52, 20, 49, 41, 59, 1,
6, 17, 11, 25, 37, 31, 22, 13, 27, 46, 44, 34, 53, 50, 42, 7,
18, 38, 32, 28, 47, 54, 51, 19, 39, 48, 55, 40, 56, 57, 58, 0,
};
// count leading zeros of nonzero 64-bit unsigned integer. Analogous to the
// 32-bit version at
// https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn.
int clz64(uint64_t x)
{
assert(x);
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x |= x >> 32;
return clz64_tab[(uint64_t)(x * 0x03f6eaf2cd271461u) >> 58];
}
Doing exhaustive testing for a 64-bit input is no longer reasonable without a supercomputer to hand, but checking all inputs of the form s*s
, s*s + s
, and s*s + 2*s
for 0 <= s < 2**32
is feasible, and gives some confidence that the code is working correctly. The following code performs that check. It takes around 148 seconds to run to completion on my Intel Core i7-based laptop, which works out at around 11.5ns per square root computation. That comes down to around 9.2ns per square root if I replace the custom clz64
implementation with __builtin_clzll
. (Both of those timings still include the overhead of the testing code, of course.)
#include <stdio.h>
int check_isqrt64(uint64_t x) {
uint64_t y = isqrt64(x);
int y_ok = y*y <= x && x - y*y <= 2*y;
if (!y_ok) {
printf("isqrt64(%llu) returned incorrect answer %llu\n", x, y);
}
return y_ok;
}
int main(void)
{
printf("Checking isqrt64 for selected values in [0, 2**64) ...\n");
for (uint64_t s = 0; s < 0x100000000u; s++) {
if (!check_isqrt64(s*s)) return 1;
if (!check_isqrt64(s*s + s)) return 1;
if (!check_isqrt64(s*s + 2*s)) return 1;
};
printf("All tests passed\n");
return 0;
}
64-bit square root: branch-free, 4 divisions, no lookup table
The final variant gives a branch-free, lookup-table-free implementation of integer square root for a 64-bit input, just to demonstrate that it's possible. It needs 4 divisions. On most machines, those divisions will be slow enough that this variant is slower than the previous variant.
uint32_t isqrt64(uint64_t x)
{
int lz = clz64(x | 1) & 62;
x <<= lz;
uint32_t y = 2 + (x >> 63);
y = (y << 1) + (x >> 59) / y;
y = (y << 3) + (x >> 53) / y;
y = (y << 7) + (x >> 41) / y;
y = (y << 15) + (x >> 17) / y;
y -= x < (uint64_t)y * y;
return y >> (lz >> 1);
}
64-bit square root of exact square; division-free!
Finally, as promised, here's an algorithm that's a bit different from those above. It computes the square root of an input that's already known to be square, using a Newton iteration for the inverse square root of the input, and operating in the 2-adic domain rather than the real domain. It requires no divisions, but it uses a lookup table, as well as relying on GCC's __builtin_ctzl
intrinsic to count trailing zeros efficiently.
#include <stdint.h>
static const uint8_t lut[128] = {
0, 85, 83, 102, 71, 2, 36, 126, 15, 37, 28, 22, 87, 50, 107, 46,
31, 10, 115, 57, 103, 98, 4, 33, 47, 58, 3, 118, 119, 109, 116, 113,
63, 106, 108, 38, 120, 61, 27, 62, 79, 101, 35, 41, 104, 13, 84, 17,
95, 53, 76, 121, 88, 34, 59, 97, 111, 5, 67, 54, 72, 82, 52, 78,
127, 42, 44, 25, 56, 125, 91, 1, 112, 90, 99, 105, 40, 77, 20, 81,
96, 117, 12, 70, 24, 29, 123, 94, 80, 69, 124, 9, 8, 18, 11, 14,
64, 21, 19, 89, 7, 66, 100, 65, 48, 26, 92, 86, 23, 114, 43, 110,
32, 74, 51, 6, 39, 93, 68, 30, 16, 122, 60, 73, 55, 45, 75, 49,
};
uint32_t isqrt64_exact(uint64_t n)
{
uint32_t m, k, x, b;
if (n == 0)
return 0;
int j = __builtin_ctzl(n);
n >>= j;
m = (uint32_t)n;
k = (uint32_t)(n >> 2);
x = lut[k >> 1 & 127];
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
b = m * x + 2 * k;
b ^= -(b >> 31);
return (b - ~b) << (j >> 1);
}
For purists, it's not hard to rewrite it to be completely branch free, to eliminate the lookup table, and to make the code completely portable and C99-compliant. Here's one way to do that:
#include <stdint.h>
uint32_t isqrt64_exact(uint64_t n)
{
uint64_t t = (n & -n) - 1 & 0x5555555555555555U;
t = t + (t >> 2) & 0x3333333333333333U;
t = t + (t >> 4) & 0x0f0f0f0f0f0f0f0fU;
int j = (uint64_t)(t * 0x0101010101010101U) >> 56;
n >>= 2 * j;
uint32_t m, k, x, b;
m = (uint32_t)n;
k = (uint32_t)(n >> 2);
x = k * ~k;
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
x += (m * x * ~x - k) * (x - ~x);
b = m * x + 2 * k;
b ^= -(b >> 31);
return (2 * b | (m & 1)) << j;
}
For a complete explanation of how this code works, along with code to do exhaustive testing, see the GitHub gist at https://gist.github.com/mdickinson/e087001d213725a93eeb8d8f447a2f40.