Removing slow int64 division from fixed point atan2() approximation

Question

I made a function to compute a fixed-point approximation of atan2(y, x). The problem is that of the ~83 cycles it takes to run the whole function, 70 cycles (compiling with gcc 4.9.1 mingw-w64 -O3 on an AMD FX-6100) are taken entirely by a simple 64-bit integer division! And sadly none of the terms of that division are constant. Can I speed up the division itself? Is there any way I can remove it?

I think I need this division because since I approximate atan2(y, x) with a 1D lookup table I need to normalise the distance of the point represented by x,y to something like a unit circle or unit square (I chose a unit 'diamond' which is a unit square rotated by 45°, which gives a pretty even precision across the positive quadrant). So the division finds (|y|-|x|) / (|y|+|x|). Note that the divisor is in 32-bits while the numerator is a 32-bit number shifted 29 bits right so that the result of the division has 29 fractional bits. Also using floating point division is not an option as this function is required not to use floating point arithmetic.

Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles just for a division). Here's the full function for reference:

int32_t fpatan2(int32_t y, int32_t x)       // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point
{
    #include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp)
    const uint32_t outfmt = 32; // final output format in s0.outfmt
    const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp);   // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t
    const uint32_t halfof = 1UL<<(outfmt-1);    // represents 0.5 in the output format, which since it is in turns means half a circle
    const uint32_t pds=ds-lutsp;    // division shift and post-division shift
    uint32_t lutind, p, t, d;
    int32_t a, b, xa, ya, xs, ys, div, r;

    xs = x >> 31;       // equivalent of fabs()
    xa = (x^xs) - xs;
    ys = y >> 31;
    ya = (y^ys) - ys;

    d = ya+xa;
    if (d==0)       // if both y and x are 0 then they add up to 0 and we must return 0
        return 0;

    // the following does 0.5 * (1. - (y-x) / (y+x))
    // (y+x) is u1.31, (y-x) is s0.31, div is in s1.29

    div = ((int64_t) (ya-xa)<<ds) / d;  // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds
    p = ((1UL<<ds) - div) >> 1;     // before shift the format is s2.29. position in u1.29

    lutind = p >> ish;      // index for the LUT
    t = (p & tmask) >> tbd;     // interpolator between two LUT entries

    a = fpatan_lut[lutind];
    b = fpatan_lut[lutind+1];
    r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip);  // linear interpolation of a and b by t in s0.32 format

    // Quadrants
    if (xs)             // if x was negative
        r = halfof - r;     // r = 0.5 - r

    r = (r^ys) - ys;        // if y was negative then r is negated

    return r;
}

How did you determine the time taken by the division? That seems unexpectedly high. — wallyk, Aug 30 '14 at 17:36
I profiled it as is and then with replacing the '/' with a '*', that gave me a 69 cycles difference every time. And yes it's necessary, ds is always 29, and x and y are usually large integer numbers (because they're usually fixed point). — Michel Rouzic, Aug 30 '14 at 17:48
Hint: replace uint32_t by uint_fast32_t and investigate in the command line options of your compiler the best way to obtain "truly" fast computations with "fast" integer types. — pablo1977, Aug 30 '14 at 18:01
@OliCharlesworth: Sure, there are only 25% of the bits to process. — wallyk, Aug 30 '14 at 18:03
The use of mingw-w64 suggests a 64bit target with an FPU; is there really any advantage in using fixed point on such a platform? It is likely to be slower. — Clifford, Aug 30 '14 at 18:06
@pablo1977 the uint_fast32_t stuff didn't change anything, I use -O3 and -ffast-math, not sure what else to use. — Michel Rouzic, Aug 30 '14 at 18:08
@wally on a 64 bit processor the data paths are 64bits wide and division is a single instruction, I don't see why it would be faster. — Clifford, Aug 30 '14 at 18:08
@Clifford: On an 8086, doing 8 bit divides is much faster than 16 bit, even if the data bus is 16 bits. Up to 184 cycles for 16 bit, and up to 112 cycles for 8 bit. It's not 4:1, but I recall that many cpus improved this quite a bit since 1982. — wallyk, Aug 30 '14 at 18:14
@Clifford it's not purely about the performance of this function, it's more like a whole project in fixed point, and not only for performance. Also if I change the cast to int32_t then that division (now in 32-bits) only takes about 32 cycles as opposed to 70 (but it's unsuitable), so yeah it makes it definitely faster. — Michel Rouzic, Aug 30 '14 at 18:14
@Clifford division is a serial process. On the processor OP is using, a 64bit div takes up to 75 cycles, a 32bit div takes up to 43 cycles (both can take as few as 16 cycles, depending, IIRC, on the magnitude of the result). On Haswell the difference is even bigger, up to 96 cycles for 64bit, up to 29 cycles for 32bit. In all of those cases, a whole lot of µops are generated, so in a sense division is not even really a single instruction. — harold, Aug 30 '14 at 18:23
@harold interesting, where do you get those numbers from? I could use that! — Michel Rouzic, Aug 30 '14 at 18:25
@MichelRouzic Agner Fog's tables. http://instlatx64.atw.hu/ is more complete. — harold, Aug 30 '14 at 18:26
Have you considered implementing division as reciprocation+multiply? You have the LZCNT instruction to normalize the divisor to the range [1, 2); You can then use a polynomial approximation to produce a reciprocal estimate, then you can use a few quadratic- or cubic-convergence Newton-Raphson iterations to increase your precision to full 64-bits. This might actually be faster depending on how well you can implement it. — Iwillnotexist Idonotexist, Aug 30 '14 at 18:43
@IwillnotexistIdonotexist that's a good idea! I didn't know the LZCNT instruction existed (though I use a LUT-based function that does that in ~3 cycles), and I know nothing about how to find that reciprocal, I'll have to look into that, thank you :). — Michel Rouzic, Aug 30 '14 at 18:59
@MichelRouzic I recently implemented integer division on a VLIW vector processor using only 8x8-bit multiplies. I'm not exactly at liberty to share the code, but for the initial estimate I used http://hal.archives-ouvertes.fr/docs/00/15/33/70/PDF/newcas07.pdf and the Newton-Raphson setup for quadratic and cubic convergence on the reciprocal given an estimate are in http://kestrel.soe.ucsc.edu/papers/new2.pdf. The initial approximation given in the Kestrel paper is cute (5 bits accurate), but I've found more value in using the 13-bit-accurate initial estimate above, avoiding two extra NR steps. — Iwillnotexist Idonotexist, Aug 30 '14 at 19:30
Thanks, I'll look into it. But I was thinking, since I would do the LZCNT-based shift like I did in my integer sqrt(), I could find the reciprocal of the result of the shift using an interpolated LUT for that [1, 2) range like I did with my integer sqrt which runs in about 15 cycles with good accuracy! — Michel Rouzic, Aug 30 '14 at 19:42

Ross Ridge · Answer 1 · 2014-08-30T18:35:30.947

Unfortunately a 70 cycles latency is typical for a 64-bit integer division on x86 CPUs. Floating point division typically has about half the latency or less. The increased cost comes from the fact modern CPUs only have dividers in their floating point execution units (they're very expensive in terms silicon area), so need to convert the integers to floating point and back again. So just substituting a floating division in place of the integer one isn't likely to help. You'll need to refactor your code to use floating point instead to take advantage of faster floating point division.

If you're able to refactor your code you might also be able to benefit from the approximate floating-point reciprocal instruction RCPSS, if you don't need an exact answer. It has a latency of around 5 cycles.

Thanks! Just for the record the whole function HAS to be fixed point with no floating point, regardless of performance. — Michel Rouzic, Aug 30 '14 at 18:41

score 3 · Accepted Answer · answered Sep 10 '14 at 12:33

Based on @Iwillnotexist Idonotexist's suggestion to use lzcnt, reciprocity and multiplication I implemented a division function that runs in about 23.3 cycles and with a pretty great precision of 1 part in 19 million with a 1.5 kB LUT, e.g. one of the worst cases being for 1428769848 / 1080138864 you might get 1.3227648959 instead of 1.3227649663.

I figured out an interesting technique while researching this, I was really struggling to think of something that could be fast and precise enough, as not even a quadratic approximation of 1/x in [0.5 , 1.0) combined with an interpolated difference LUT would do, then I had the idea of doing it the other way around so I made a lookup table that contains the quadratic coefficients that fit the curve on a short segment that represents 1/128th of the [0.5 , 1.0) curve, which gives you a very small error like so. And using the 7 most significant bits of what represents x in the [0.5 , 1.0) range as a LUT index I directly get the coefficients that work best for the segment that x falls into.

Here's the full code with the lookup tables ffo_lut.h and fpdiv.h:

#include "ffo_lut.h"

static INLINE int32_t log2_ffo32(uint32_t x)    // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1)
{
    int32_t y;

    y = x>>21;  if (y)  return ffo_lut[y]+21;
    y = x>>10;  if (y)  return ffo_lut[y]+10;
    return ffo_lut[x];
}

// Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y
// The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup
static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt)   // ~23.3 cycles, max error (by division) 53.39e-9
{
    #include "fpdiv.h"  // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp)
    const int32_t *c;
    int32_t xa, xs, p, sh;
    uint32_t expon, frx, lutind;
    const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1);  // the shift for the index, the shift for 31-bit xa, the value of 0.5
    int64_t out;
    int64_t c0, c1, c2;

    // turn x into xa (|x|) and sign of x (xs)
    xs = x >> 31;
    xa = (x^xs) - xs;

    // decompose |x| into frx * 2^expon
    expon = log2_ffo32(xa);
    frx = (xa << (31-expon)) >> cfs;    // the fractional part is now in 0.27 format

    // lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result
    lutind = (frx - half) >> ish;       // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits
    lutind *= 3;                // 3 entries for each index
    c = &fpdiv_lut[lutind];         // c points to the correct c0, c1, c2
    c0 = c[0];    c1 = c[1];    c2 = c[2];
    p = (int64_t) frx * frx >> prec;    // x^2
    p = c2 * p >> prec;         // c2 * x^2
    p += c1 * frx >> prec;          // + c1 * x
    p += c0;                // + c0, p = (1.0 , 2.0] in 2.27 format

    // apply the necessary bit shifts and reapplies the original sign of x to make final result
    sh = expon + prec - outfmt;     // calculates the final needed shift
    out = (int64_t) y * p;          // format is s31 + 1.27 = s32.27
    if (sh >= 0)
        out >>= sh;
    else
        out <<= -sh;
    out = (out^xs) - xs;            // if x was negative then out is negated

    return out;
}

I think ~23.3 cycles is about as good as it's gonna get for what it does, but if you have any ideas to shave a few cycles off please let me know.

As for the fpatan2() question the solution would be to replace this line:

div = ((int64_t) (ya-xa)<<ds) / d;

with that line:

div = fpdiv(ya-xa, d, ds);

Did you try implementing Netch's suggestion to get the compiler to do a 64b/32b => 32b division? [Agner Fog's tables](http://agner.org/optimize) say `idiv r32` on AMD Piledriver is still 13-40c latency, rather than 13-71c. But on Skylake, `idiv r32` is 26c latency, **one per 6c throughput**. Haswell has not quite as good throughput, but still partially pipelined. — Peter Cordes, Apr 12 '16 at 00:35

score 1 · Answer 3 · edited Apr 11 '16 at 14:01

Yours time hog instruction:

div = ((int64_t) (ya-xa)<<ds) / d;

exposes at least two issues. The first one is that you mask the builtin div function; but this is minor fact, could be never observed. The second one is that first, according to C language rules, both operands are converted to common type which is int64_t, and, then, division for this type is expanded into CPU instruction which divides 128-bit dividend by 64-bit divisor(!) Extract from assembly of cut-down version of your function:

  21:   48 89 c2                mov    %rax,%rdx
  24:   48 c1 fa 3f             sar    $0x3f,%rdx ## this is sign bit extension
  28:   48 f7 fe                idiv   %rsi

Yep, this division requires about 70 cycles and can't be optimized (well, really it can, but e.g. reverse divisor approach requires multiplication with 192-bit product). But if you are sure this division can be done with 64-bit dividend and 32-bit divisor and it won't overflow (quotient will fit into 32 bits) (I agree because ya-xa is always less by absolute value than ya+xa), this can be sped up using explicit assembly request:

uint64_t tmp_num = ((int64_t) (ya-xa))<<ds;
asm("idivl %[d]" :
    [a] "=a" (div1) :
    "[a]" (tmp_num), "d" (tmp_num >> 32), [d] "q" (d) :
    "cc");

this is quick&dirty and shall be carefully verified, but I hope the idea is understood. The resulting assembly now looks like:

  18:   48 98                   cltq   
  1a:   48 c1 e0 1d             shl    $0x1d,%rax
  1e:   48 89 c2                mov    %rax,%rdx
  21:   48 c1 ea 20             shr    $0x20,%rdx
  27:   f7 f9                   idiv   %ecx

This seems to be huge advance because 64/32 division requires up to 25 clock cycles on Core family, according to Intel optimization manual, instead of 70 you see for 128/64 division.

More minor approvements can be added; e.g. shifts can be done yet more economically in parallel:

uint32_t diff = ya - xa;
uint32_t lowpart = diff << 29;
uint32_t highpart = diff >> 3;
asm("idivl %[d]" :
    [a] "=a" (div1) :
    "[a]" (lowpart), "d" (highpart), [d] "q" (d) :
    "cc");

which results in:

  18:   89 d0                   mov    %edx,%eax
  1a:   c1 e0 1d                shl    $0x1d,%eax
  1d:   c1 ea 03                shr    $0x3,%edx
  22:   f7 f9                   idiv   %ecx

but this is minor fix, compared to the division-related one.

To conclude, I really doubt this routine is worth to be implemented in C language. The latter is quite ineconomical in integer arithmetic, requiring useless expansions and high part losses. The whole routine is worth to be moved to assembler.

score -1 · Answer 4 · answered Aug 30 '14 at 18:11

Given an fpatan() implementation, you could simply implement fpatan2() in terms of that.

Assuming constants defined for pi abd pi/2:

int32_t fpatan2( int32_t y, int32_t x) { fixed theta ;

if( x == 0 )
{
    theta = y > 0 ? fixed_half_pi : -fixed_half_pi ;
}
else
{
    theta = fpatan( y / x ) ;
    if( x < 0 )
    {
        theta += ( y < 0 ) ? -fixed_pi : fixed_pi ;
    }
}

return theta ;

}

Note that fixed library implementations are easy to get very wrong. You might take a look at Optimizing Math-Intensive Applications with Fixed-Point Arithmetic. The use of C++ in the library under discussion makes the code much simpler, in most cases you can just replace the float or double keyword with fixed. It does not however have an atan2() implementation, the code above is adapted from my implementation for that library.

I think the `y / x` would mean that it's it's not going to perform any better because it's still performing an integer division. — Ross Ridge, Aug 30 '14 at 18:13
Clifford `y / x` isn't going to work in fixed point, just imagine if x isn't dwarfed by y, the result of that division is going to be a single digit integer. Also all your approach does is deal with quadrants, my code already does that in a fast and branchless way :). As for getting the implementation right it isn't so hard, you can test your function billions of time against atan2() which I did (I get a max error of 0.07667 arc seconds with a 4 kB LUT). — Michel Rouzic, Aug 30 '14 at 18:19

Removing slow int64 division from fixed point atan2() approximation

4 Answers4

Linked