16

I know this is a recurring question, but I haven't really found a useful answer yet. I'm basically looking for a fast approximation of the function acos in C++, I'd like to know if I can significantly beat the standard one.

But some of you might have insights on my specific problem: I'm writing a scientific program which I need to be very fast. The complexity of the main algorithm boils down to computing the following expression (many times with different parameters):

sin( acos(t_1) + acos(t_2) + ... + acos(t_n) )

where the t_i are known real (double) numbers, and n is very small (like smaller than 6). I need a precision of at least 1e-10. I'm currently using the standard sin and acos C++ functions.

Do you think I can significantly gain speed somehow? For those of you who know some maths, do you think it would be smart to expand that sine in order to get an algebraic expression in terms of the t_i (only involving square roots)?

Thank you your your answers.

Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
Seub
  • 2,451
  • 4
  • 25
  • 34

7 Answers7

8

The code below provides simple implementations of sin() and acos() that should satisfy your accuracy requirements and that you might want to try. Please note that the math library implementation on your platform is very likely highly tuned for the specific hardware capabilities of that platform and is probably also coded in assembly for maximum efficiency, so simple compiled C code not catering to specifics of the hardware is unlikely to provide higher performance, even when the accuracy requirements are somewhat relaxed from full double precision. As Viktor Latypov points out, it may also be worthwhile to search for algorithmic alternatives that do not require expensive calls to transcendental math functions.

In the code below I have tried to stick to simple, portable constructs. If your compiler supports the rint() function [specified by C99 and C++11] you might want to use that instead of my_rint(). On some platforms, the call to floor() can be expensive since it requires dynamic changing of machine state. The functions my_rint(), sin_core(), cos_core(), and asin_core() would want to be inlined for best performance. Your compiler may do that automatically at high optimization levels (e.g. when compiling with -O3), or you could add an appropriate inlining attribute to these functions, e.g. inline or __inline depending on your toolchain.

Not knowing anything about your platform I opted for simple polynomial approximations, which are evaluated using Estrin's scheme plus Horner's scheme. See Wikipedia for a description of these evaluation schemes:

http://en.wikipedia.org/wiki/Estrin%27s_scheme , http://en.wikipedia.org/wiki/Horner_scheme

The approximations themselves are of the minimax type and were custom generated for this answer with the Remez algorithm:

http://en.wikipedia.org/wiki/Minimax_approximation_algorithm , http://en.wikipedia.org/wiki/Remez_algorithm

The identities used in the argument reduction for acos() are noted in the comments, for sin() I used a Cody/Waite-style argument reduction, as described in the following book:

W. J. Cody, W. Waite, Software Manual for the Elementary Functions. Prentice-Hall, 1980

The error bounds mentioned in the comments are approximate, and have not been rigorously tested or proven.

/* not quite rint(), i.e. results not properly rounded to nearest-or-even */
double my_rint (double x)
{
  double t = floor (fabs(x) + 0.5);
  return (x < 0.0) ? -t : t;
}

/* minimax approximation to cos on [-pi/4, pi/4] with rel. err. ~= 7.5e-13 */
double cos_core (double x)
{
  double x8, x4, x2;
  x2 = x * x;
  x4 = x2 * x2;
  x8 = x4 * x4;
  /* evaluate polynomial using Estrin's scheme */
  return (-2.7236370439787708e-7 * x2 + 2.4799852696610628e-5) * x8 +
         (-1.3888885054799695e-3 * x2 + 4.1666666636943683e-2) * x4 +
         (-4.9999999999963024e-1 * x2 + 1.0000000000000000e+0);
}

/* minimax approximation to sin on [-pi/4, pi/4] with rel. err. ~= 5.5e-12 */
double sin_core (double x)
{
  double x4, x2, t;
  x2 = x * x;
  x4 = x2 * x2;
  /* evaluate polynomial using a mix of Estrin's and Horner's scheme */
  return ((2.7181216275479732e-6 * x2 - 1.9839312269456257e-4) * x4 + 
          (8.3333293048425631e-3 * x2 - 1.6666666640797048e-1)) * x2 * x + x;
}

/* minimax approximation to arcsin on [0, 0.5625] with rel. err. ~= 1.5e-11 */
double asin_core (double x)
{
  double x8, x4, x2;
  x2 = x * x;
  x4 = x2 * x2;
  x8 = x4 * x4;
  /* evaluate polynomial using a mix of Estrin's and Horner's scheme */
  return (((4.5334220547132049e-2 * x2 - 1.1226216762576600e-2) * x4 +
           (2.6334281471361822e-2 * x2 + 2.0596336163223834e-2)) * x8 +
          (3.0582043602875735e-2 * x2 + 4.4630538556294605e-2) * x4 +
          (7.5000364034134126e-2 * x2 + 1.6666666300567365e-1)) * x2 * x + x; 
}

/* relative error < 7e-12 on [-50000, 50000] */
double my_sin (double x)
{
  double q, t;
  int quadrant;
  /* Cody-Waite style argument reduction */
  q = my_rint (x * 6.3661977236758138e-1);
  quadrant = (int)q;
  t = x - q * 1.5707963267923333e+00;
  t = t - q * 2.5633441515945189e-12;
  if (quadrant & 1) {
    t = cos_core(t);
  } else {
    t = sin_core(t);
  }
  return (quadrant & 2) ? -t : t;
}

/* relative error < 2e-11 on [-1, 1] */
double my_acos (double x)
{
  double xa, t;
  xa = fabs (x);
  /* arcsin(x) = pi/2 - 2 * arcsin (sqrt ((1-x) / 2)) 
   * arccos(x) = pi/2 - arcsin(x)
   * arccos(x) = 2 * arcsin (sqrt ((1-x) / 2))
   */
  if (xa > 0.5625) {
    t = 2.0 * asin_core (sqrt (0.5 * (1.0 - xa)));
  } else {
    t = 1.5707963267948966 - asin_core (xa);
  }
  /* arccos (-x) = pi - arccos(x) */
  return (x < 0.0) ? (3.1415926535897932 - t) : t;
}
eddie
  • 1,252
  • 3
  • 15
  • 20
njuffa
  • 23,970
  • 4
  • 78
  • 130
4
sin( acos(t1) + acos(t2) + ... + acos(tn) )

boils down to the calculation of

sin( acos(x) ) and cos(acos(x))=x

because

sin(a+b) = cos(a)sin(b)+sin(a)cos(b).

The first thing is

sin( acos(x) )  = sqrt(1-x*x)

Taylor series expansion for the sqrt reduces the problem to polynomial calculations.

To clarify, here's the expansion to n=2, n=3:

sin( acos(t1) + acos(t2) ) = sin(acos(t1))cos(acos(t2)) + sin(acos(t2))cos(acos(t1) = sqrt(1-t1*t1) * t2 + sqrt(1-t2*t2) * t1

cos( acos(t2) + acos(t3) ) = cos(acos(t2)) cos(acos(t3)) - sin(acos(t2))sin(acos(t3)) = t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)

sin( acos(t1) + acos(t2) + acos(t3)) = 
sin(acos(t1))cos(acos(t2) + acos(t3)) + sin(acos(t2)+acos(t3) )cos(acos(t1)=
sqrt(1-t1*t1) * (t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)) + (sqrt(1-t2*t2) * t3 + sqrt(1-t3*t3) * t2 ) * t1

and so on.

The sqrt() for x in (-1,1) can be computed using

x_0 is some approximation, say, zero

x_(n+1) = 0.5 * (x_n + S/x_n)  where S is the argument.

EDIT: I mean the "Babylonian method", see Wikipedia's article for details. You will need not more than 5-6 iterations to achieve 1e-10 with x in (0,1).

Viktor Latypov
  • 14,289
  • 3
  • 40
  • 55
  • I am well aware of the maths, I'm not so good at computer performance. "Taylor series expansion for the sqrt reduces the problem to polynomial calculations." no, it reduces it to a Taylor expansion, that's it. A Taylor expansion is a *very* inefficient way to compute a square root. – Seub Jun 29 '12 at 13:27
  • On the other hand, it's probably much quicker to compute a square root than it is to compute an acos using the standard functions of C++. But I'm afraid that the gain I get this way, I'll lose it by having to expand the sine like you explain, involving more computations... I'm not sure. – Seub Jun 29 '12 at 13:29
  • @user1367124: Anyways sqrt is faster and you can calculate it with simple iterations. Fixed my answer a little. – Viktor Latypov Jun 29 '12 at 13:30
  • And a couple of multiplications are definitely faster than evaluation of sin(a+b). – Viktor Latypov Jun 29 '12 at 13:34
  • I wouldn't be so sure. First I don't see the point of saying how to compute a square root, are you suggesting you're doing better than the default sqrt? Second, when you say "a couple", it definitely depends on how many. After all, the computation of sin or acos is also "a couple of multiplications". – Seub Jun 29 '12 at 13:41
  • `SQRTPD` takes up to 32 cycles on Nehalem and up to 21 cycles on Sandy Bridge to compute the square root of two `double` values. `MULPD` takes up to 5 cycles. `DIVPD` takes up to 22 cycles. Why would you ever want to _not_ perform square root computation using dedicated instructions on modern CPUs? – Hristo Iliev Jun 29 '12 at 15:47
  • The original problem was to reduce acos() to sqrt(). I bet that FPU's/SSE sqrt() routine is more than enough in terms of speed and accuracy. So no objections here - just live with standard sqrt(). – Viktor Latypov Jun 29 '12 at 15:53
3

As Jonas Wielicki mentions in the comments, there isn't much precision trade-offs you can make.

Your best bet is to try and use the processor intrinsics for the functions (if your compiler doesn't do this already) and using some math to reduce the amount of calculations necessary.

Also very important is to keep everything in a CPU-friendly format, make sure there are few cache misses, etc.

If you are calculating large amounts of functions like acos perhaps moving to the GPU is an option for you?

orlp
  • 112,504
  • 36
  • 218
  • 315
  • 1
    "try and use the processor intrinsics", "keep everything in a CPU-friendly format, make sure there are few cache misses", "moving to the GPU" : unfortunately I have no idea what that means! I should do my homework and look it up. But I'm really not acquainted to scientific programming, performance etc. – Seub Jun 29 '12 at 12:09
  • 1
    @user1367124 yes, I think this is the best you can do. Also if you have multiple CPU cores at hand, I'd suggest to try to parallelize the work over multiple threads where possible. If you can easily parallelize your logic to an arbitary amount of threads, you really should go to the GPU. – Jonas Schäfer Jun 29 '12 at 12:29
2

You can try to create lookup tables, and use them instead of standard c++ functions, and see if you see any performance boost.

BЈовић
  • 62,405
  • 41
  • 173
  • 273
  • In theory this could work, however a lookup table which gives a accuracy of 10e-10 is likely to big for the cache, so the cachemisses are likely to destroy the benefits. A smaller lookup table used in conjunction with interpolation might work, but whether or not it is faster then the standard functions I can't say – Grizzly Jun 29 '12 at 12:25
  • I believe that a little formula manipulation and high-order series expansion is a much better thing to do. – Viktor Latypov Jun 29 '12 at 12:46
1

Significant gains can be made by aligning memory and streaming in the data to your kernel. Most often this dwarfs the gains that can be made by recreating the math functions. Think of how you can improve memory access to/from your kernel operator.

Memory access can be improved by using buffering techniques. This depends on your hardware platform. If you are running this on a DSP, you could DMA your data onto an L2 cache and schedule the instructions so that multiplier units are fully occupied.

If you are on general purpose CPU, most you can do is to use aligned data, feed the cache lines by prefetching. If you have nested loops, then the inner most loop should go back and forth (i.e. iterate forward and then iterate backward) so that cache lines are utilised, etc.

You could also think of ways to parallelize the computation using multiple cores. If you can use a GPU this could significantly improve performance (albeit with a lesser precision).

Indy9000
  • 8,651
  • 2
  • 32
  • 37
  • thank you, unfortunately, I'm new to scientific programming and performance, and I have no idea how to do that ;) – Seub Jun 29 '12 at 12:10
  • @user - I don't want to be offensive, but if you ask about driving at Formula 1 speed, you might have to take some lessons. Or hire a driver (hint!). – Bo Persson Jun 29 '12 at 12:43
  • You're probably right, I was just wondering if someone knew a faster way or library or whatever to compute the acos, maybe not at max precision... – Seub Jun 29 '12 at 13:39
1

In addition to what others have said, here are some techniques at speed optimization:

Profile

Find out where in the code most of the time is spent. Only optimize that area to gain the mose benefit.

Unroll Loops

The processors don't like branches or jumps or changes in the execution path. In general, the processor has to reload the instruction pipeline which uses up time that can be spent on calculations. This includes function calls.

The technique is to place more "sets" of operations in your loop and reduce the number of iterations.

Declare Variables as Register

Variables that are used frequently should be declared as register. Although many members of SO have stated compilers ignore this suggestion, I have found out otherwise. Worst case, you wasted some time typing.

Keep Intense Calculations Short & Simple

Many processors have enough room in their instruction pipelines to hold small for loops. This reduces the amount of time spent reloading the instruction pipeline.

Distribute your big calculation loop into many small ones.

Perform Work on Small Sections of Arrays & Matrices

Many processors have a data cache, which is ultra fast memory very close to the processor. The processor likes to load the data cache once from off-processor memory. More loads require time that can be spent making calculations. Search the web for "Data Oriented Design Cache".

Think in Parallel Processor Terms

Change the design of your calculations so they can be easily adaptable to use with multiple processors. Many CPUs have multiple cores that can execute instructions in parallel. Some processors have enough intelligence to automatically delegate instructions to their multiple cores.

Some compilers can optimize code for parallel processing (look up the compiler options for your compiler). Designing your code for parallel processing will make this optimization easier for the compiler.

Analyze Assembly Listing of Functions

Print out the assembly language listing of your function. Change the design of your function to match that of the assembly language or to help the compiler generate more optimal assembly language.

If you really need more efficiency, optimize the assembly language and put in as inline assembly code or as a separate module. I generally prefer the latter.

Examples

In your situation, take first 10 terms of the Taylor expansion, calculate them separately and place into individual variables:

double term1, term2, term3, term4;
double n, n1, n2, n3, n4;
n = 1.0;
for (i = 0; i < 100; ++i)
{
  n1 = n + 2;
  n2 = n + 4;
  n3 = n + 6;
  n4 = n + 8;

  term1 = 4.0/n;
  term2 = 4.0/n1;
  term3 = 4.0/n2;
  term4 = 4.0/n3;

Then sum up all of your terms:

  result = term1 - term2 + term3 - term4;
  // Or try sorting by operation, if possible:
  // result = term1 + term3;
  // result -= term2 + term4;

  n = n4 + 2;
  }
Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
0

Lets consider two terms first:

cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b)

or cos(a+b) = cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b))

Taking cos to the RHS

a+b = acos( cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b)) ) ... 1

Here cos(a) = t_1 and cos(b) = t_2 a = acos(t_1) and b = acos(t_2)

By substituting in equation (1), we get

acos(t_1) + acos(t_2) = acos(t_1*t_2 - sqrt(1 - t_1*t_1) * sqrt(1 - t_2*t_2))

Here you can see that you have combined two acos into one. So you can pair up all the acos recursively and form a binary tree. At the end, you'll be left with an expression of the form sin(acos(x)) which equals sqrt(1 - x*x).

This will improve the time complexity.

However, I'm not sure about the complexity of calculating sqrt().

Shashwat
  • 2,538
  • 7
  • 37
  • 56