0

I'm leaving the title more general, but I specifically want a fast 64-bit square root (sqrt) function for inputs between 0.5 and 1.0. (Actually, some "SSE2 reciprocal sqrt" would be ideal for my numerical simulation, but I assume this is too much to ask for. And, to be complete, a little division is also being used.)

The intrinsics I come across work with floating-point data which therefore waste 11 bits for the exponent. I know 53 bits is almost 64 bits, so yes, CPU makers can probably assume people like me will need to design some bigint algorithm in software anyway, so I'm guessing CPU makers simply put this low on their priority list.

Or, is there some bigger reason to avoid fixed-point intrinsics that I am missing? If I need slightly better than 53-bit accuracy (e.g., 60-bit accuracy), do I need to just accept a ~10x slow-down?

bobuhito
  • 285
  • 2
  • 20
  • Re "inputs between 0.5 and 1.0". So for purely fractional inputs in the interval [0.5, 1.0) using the scale factor 2**64? What are the accuracy requirements for this use case? – njuffa Jul 22 '21 at 21:37

1 Answers1

0

Title question: _mm_mulhrs_epi16 (pmulhrsw) is I think intended for 16-bit fixed-point multiply with averaging.

sqrt: x86 doesn't have any integer square root support, but the x87 fsqrt instruction does handle 80-bit long double with a 64-bit mantissa. (i.e. sqrtl() in C). But only one at a time (scalar), with even worse throughput than sqrtpd. That will also cost store/reload latency getting data in/out of x87 registers, even though fild / fistp can convert from/to int64_t with round-to-nearest, if you can get a C compiler to emit those.

e.g. on GNU/Linux (or other non-Windows platforms where long double is the 80-bit x87 type), I think this might be viable:

#include <stdint.h>
#include <math.h>

int64_t fixed_point_sqrt(int64_t a) {
    return lrintl(sqrtl(a) * (1LL<<32));  // rescale for your fixed point range
}

(lrintl does long double -> long conversion with the current default rounding mode, i.e. round to nearest. Otherwise just casting, you can get SSE3 fisttp truncation, or without SSE3 a slow change of the rounding mode to truncation and back.)

With GCC and clang (targeting Linux), you get (on Godbolt)

# gcc and clang -O3 -fno-math-errno   are both similar; this is clang:
fixed_point_sqrt(long):                  # @fixed_point_sqrt(long)
        mov     qword ptr [rsp - 16], rdi
        fild    qword ptr [rsp - 16]         # convert int64 -> 80-bit x87
        fsqrt
        fmul    dword ptr [rip + .LCPI1_0]   # float 4.2949673E+9 is exactly representable
        fistp   qword ptr [rsp - 8]          # convert back with roundinging
        mov     rax, qword ptr [rsp - 8]
        ret

Actually fisttp vs. fistp might not matter if the 80-bit FP value is always a whole integer anyway; not sure how the range works out there.


The normal use-case for fixed-point is for narrow elements where you don't want to waste space on an exponent field; your case of very limited dynamic range (largest value only twice the smallest) does make the exponent field redundant for FP values, but modern x86 CPUs spend a lot of transistors on SIMD-FP throughput so that's still a good bet for high performance.

If you care about maximum precision, note that the integer square root of a 64-bit integer only has 32 significant bits. But for fixed-point, the square roots of numbers between 0.5 and 1.0 are between .75 and 1.0, so you're only losing 1 bit of precision (MSB always set in the result). So the rescaling makes it different from pure integer sqrt.


If you need more mantissa bits, you can use double-double (https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic) pairs of double, . Multiplying and adding such numbers is possible with SIMD, but IDK how efficiently one could implement sqrt.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • But, isn't [x87 dead](https://www.zdnet.com/article/nvidia-de-optimizes-physx-for-the-cpu)? I am using an AMD CPU built around 2013 and I do see sqrtl but it returns a 64-bit double...I guess that is why you mention fild/fistp, so maybe there's a chance I could get all 80 bits with assembly language. – bobuhito Jul 24 '21 at 02:52
  • I see that my assembly language is exactly the same using sqrtl and sqrt, so I'm going to conclude that I don't have a x87 processor. – bobuhito Jul 24 '21 at 03:26
  • @bobuhito: x87 isn't "dead", it's just obsolete for anything except needing somewhat more precision than `double`. I guess you're using Windows, where the standard ABI has `long double` = `double`. In that case, there isn't a C name for the full-precision x87 type, so you'd have to use it manually from asm. (And make sure MSVC's CRT startup code hasn't reduced the x87 precision. https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/, although I think that was only a thing for 32-bit mode.) – Peter Cordes Jul 24 '21 at 03:27
  • @bobuhito: There are options for making `long double` be 80-bit in C with GCC, I think, but you'd need a way to get an `fsqrt` instruction inlined, along with `fild` / `fistp` (or SSE3 `fisttp` to truncate like C float -> int semantic). So it might be easier to just use inline asm. – Peter Cordes Jul 24 '21 at 03:31
  • @bobuhito: On Linux it's easy, though: https://godbolt.org/z/he8bjcYee - `lrintl(sqrtl(a))` inlines the way you want, with `fild` / `fsqrt` / `fistp` (round-to-nearest). Of course that's integer square root, not fixed-point. You'd have to multiply by 2^32 I think to get back to the range you want before fp->int64_t conversion. Also of course this is int64_t, not uint64_t - unsigned is slower in hardware. – Peter Cordes Jul 24 '21 at 03:36
  • I am using Windows as you suspected, and I just checked the x87 precision with _controlfp_s and get 0x0008001f (so I believe I already have maximum precision). But, I test x87 with an asm function and always only see 53 bits of mantissa precision. This can be seen simply with "fldln2; fld approx; fsub" (where approx in data is "0.69314718055"). My result is 9.9452900342161809e-0012, indicating only 53 bits of precision. Am I doing something wrong? – bobuhito Jul 24 '21 at 17:10
  • @bobuhito: How did you return / print the result? If you returned it as `double` (or the equivalent `long double`), then it's not surprising it got rounded somewhere. One way to test would be to see if you can round-trip INT64_MAX through fild/fistp without rounding it up to out-of-range for int64_t, which would lead to an integer-indefinite result (only MSB set, i.e. INT64_MIN) – Peter Cordes Jul 24 '21 at 18:33
  • I viewed the result in many ways (I can even view ST0 directly in the Visual Studio debugger), but it really shouldn't matter because the 3 steps in my asm test all get done inside the FPU. Your suggestion to work with fild/fistp, however, will have 64 bits of precision, but that's because they are always integers. But, I ultimately need to take a sqrt and go floating-point. It seems I've done this correctly, and that my AMD processor might just be cutting my mantissa precision at 53 bits. – bobuhito Jul 24 '21 at 20:20
  • @bobuhito: I thought you were doing fixed-point integer so you'd need the result as an `int64_t`? If you return ST0 and describe it to the C compiler as a `long double`, then for ABIs that have double = long double, it will probably at some point get stored to memory as a qword not a tbyte (10-byte) by C compiler-generated code. It's not the hardware FPU that's rounding it, it's a store/reload somewhere (probably in the calling convention for passing long double to printf) before it gets converted to a string of decimal digits. – Peter Cordes Jul 24 '21 at 20:26
  • @bobuhito: So if you *do* want to return ST0 and let the C compiler work with that FP value, you're going to need to use a C compiler configured for `long double` = the x87 type, i.e. `sizeof(long double) > 8`. Like `gcc -mlong-double-80` (but you'd need a libc compiled that way to pass long double to printf). For clang, [Clang compiler (x86): 80 bit long double](https://stackoverflow.com/q/58544443). Or use the GNU C `__float80` / [How does GCC compile the 80 bit wide 10 byte float \_\_float80 on x86\_64?](https://stackoverflow.com/q/49909914) – Peter Cordes Jul 24 '21 at 20:29
  • Using int64_t or the 10B TBYTE data blocks I've been FSTP'ing to in asm would be fine...the compiler is not a problem for me since I am getting comfortable with asm. My point is that I still can't get fsqrt to give more than about 53 bits of mantissa precision. I know you are skeptical, but I doubt this is a read mistake on my part. – bobuhito Jul 25 '21 at 01:43
  • The most convincing thing is that I see all 80 bits in the TBYTE and the mantissa says fsqrt(0x0.7FFFFFFFFFFFFFFF) = 0x0.B504F333F9DE6800, but the true sqrt is 0x0.B504F333F9DE6484. I'd say the lowest 11 bits are being sacrificed by my FPU for speed since they aren't important for normal doubles anyway. – bobuhito Jul 25 '21 at 01:43
  • @bobuhito: Ok, you weren't being clear on how you were looking at the results, whether that was your program's eventual output or something you saw with the debugger while single-stepping `fsqrt` itself. Square Root is an IEEE "basic" operation, so FPUs are required to implement it exactly, with <= 0.5ulp of rounding error, i.e. the least-significant digit of the mantissa must be correctly rounded. So getting all zeros in the low few mantissa bits when that's not the exact result is a sign that software must have configured it for reduced precision, e.g. 64-bit instead of 80-bit. – Peter Cordes Jul 25 '21 at 01:57
  • @bobuhito: Are you sure `_controlfp_s` is giving you the x87 control word (http://www.ray.masmcode.com/tutorial/fpuchap1.htm#cword)? MS's docs say it's "portable" (https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/controlfp-s?view=msvc-160), and on an x86-64 platform it seems to be saying it gives you the SSE MXCSR. Your `0x0008001f` value for it has bits 9 and 10 cleared, so if that was really the x87 control word, that would mean 24-bit mantissa precision, but you're clearly getting more than that. You could just use `fninit` somewhere early in your program, or `_control87`. – Peter Cordes Jul 25 '21 at 02:01
  • I believe you're mistaking the bit spec. Cleared bits mean 64-bit mantissa (see the microsoft link you posted in your last comment), so I should already be in 64-bit mode. But, you might be onto something, maybe it's stuck somehow in 53-bit mode. ? – bobuhito Jul 25 '21 at 03:23
  • But, thanks to your last comment, I just tried inserting an fninit as the first line of the asm code, and all 64 bits now are correct! By the way, checking before and after the fninit, _controlfp_s still shows 0x0008001f. Hmmm, I wish 90% of my life weren't spent stuck in ruts like this :( Thanks for all your help! – bobuhito Jul 25 '21 at 03:53
  • @bobuhito: `_controlfp_s` doesn't access the **x87** control word at all in x64 builds: *On x64 architectures, only the SSE2 control word that's stored in the MXCSR register is affected.* You're right that `_controlfp_s` is an abstraction layer with the precision-control field in a different position from [the actual x87 control word](http://www.ray.masmcode.com/tutorial/fpuchap1.htm#cword), and 00 does mean 64-bit (again unlike x87), but that field only applies on 32-bit x86 as noted in the table. So of course `fninit` doesn't have an effect on `controlfp` results. – Peter Cordes Jul 25 '21 at 05:32
  • @bobuhito: You know that MXCSR is the rounding / exception mask and state register for SSE math, right? (https://softpixel.com/~cwright/programming/simd/sse.php). Totally separate at a hardware / ISA level from the x87 state. It's only MSVC's weird `_controlfp_s` function that even tries to unify them behind one API. (And apparently only for 32-bit, because x64 MSVC with its `double = long double` ABI won't ever generate x87 instructions.) – Peter Cordes Jul 25 '21 at 05:34