Fastest pow() replacement via modified exp. by squaring when lower powers are already calculated

Question

EDIT:

Goal :
Generate a ubiquitous method for deriving a custom power function that outperforms the built-in pow(double, uint) by reusing precalculated/cached powers from power calculations on common variables.

What's already been done:
I've already derived such a function that's roughly 40% faster than the built-in, however this is a brute-force hand-derived function -- I want a method for autogenerating such a power function block for an arbitrary uint power.

KNOWNS

To derive an optimal custom pow(double, uint) you need some knowns. For this question the knowns (to clarify) are:

The power will be an integer.
The maximum the power can be is known (N_MAX).
The precalculated powers that can be (re)used are known at compile time (e.g. in my example r2, r4, and r6).
The square r2 can be assumed to always have been calculated regardless of the other precalculated powers.

SOLUTION REQUIREMENTS

An optimal solution requiring a separate program to write a case lookup table or preprocessor logic to generate such a table is acceptable, however, non-optimal solutions using hand-generated (i.e. brute force derived) lookup tables using the powers on hand will not be accepted (as I have that already, and show that in my example... the idea is to get away from this).

POSSIBLE SOLUTION ROUTE

As a suggestion, you know N_MAX and a set of powers that are precalculated B (B={2,4,6} for my example). You can produce either in a separate program or in the preprocessor a table of all squares of Sq(Bi, x) <= N_MAX. You can use this to form a basis setA, which you then search somehow to determine the least number of terms that can be summed to produce an arbitrary exponent ofn>>1, wheren<=N_MAX` (the shift is due to that we take care of the odd case by checking the LSB and multiplying by the sqrt(r2)).

THEORETICAL BACKGROUND

I believe formally the below method is a modified version of exponentations by squaring:

http://en.wikipedia.org/wiki/Exponentiation_by_squaring

....which takes advantage of the fact that certain lower order powers are already by necessity precalculated, hence it shifts the optimal set of multiplications from a vanilla exponentation by squaring (which I assume pow(double, int) uses).

However there are significant savings by using the stored small power intermediates instead of simple exp. by squares on the r2.

THEORETICAL PERFORMANCE

For example, for one set of objects n=14.... in this scenario exp. by powers gives

double r4 = Sq(r2), r14=Sq(r4)*r4*r2; //4 op.

... which takes 4 FP multiplications..... but using the r2 and r6 we have

double r14=Sq(r6)*r2; //2 op.

.... 2 FP multiplications.... in other words, by going from "dumb" exponentation by squares to my modified exp. by squares using the common exponent precaching, I've cut my cost of calculations for 50% in terms of multiplications ... at least until memory costs are considered.

REAL PERFORMANCE

With my current method (compiled with gcc -O3) I get 35.1 sec. to run 1 million cycles of my program, versus (w/ no other modifications) 56.6 s using the built int pow(double, int).... so almost the theoretical speedup.

At this point you may be scratching your head at how a 50% cut in multiplications on a single instruction line can deliver a ~40% speedup. But basically this line of code is called 1,000+ times per cycle and is by far the most evaluated/most expensive line of code in the entire program. Hence the program appears highly sensitive to a small optimization/improvement in this chunk.

ORIGINAL POST and EXAMPLE CODE

I need to replace the pow(double, int) function as I already have calculated a 6th power term and have 2nd, 4th power intermediates saved, all of which can be used to reduce multiplications in the second pow call, which uses the same double base.

More specifically, in my c++ code I have a performance critical calculation snippet of code where I raise the reciprocal of the distance between 3D points to the 6th power and nth power. e.g.:

double distSq = CalcDist(p1,p2), r2 = a/distSq, r6 = r2 * r2 * r2;
results += m*(pow(sqrt(r2), n) - r6);

Where m and a are constants related to the fitted equation and n is the arbitrary power.

A slightly more efficient form is:

double distSq = CalcDist(p1,p2), r2 = a/distSq, r6 = r2 * r2 * r2;
results += m*(pow(r2, n)*(n&0x1?sqrt(r2):1.0) - r6);

However, this is also not optimal. What I've found to be significantly faster is to have a custom pow function that uses the multiples r2, r4, and r6, which I have to calculate already anyways for the second term.

e.g.:

double distSq = CalcDist(p1,p2), r2 = a/distSq, r4 = r2 * r2, r6 = r4 * r2;
results += m*(POW(r2, r4, r6 n) - r6);

Inside the function:

double POW(double r2, double r4, double r6, uint n)
{
   double results = (n&0x1 : sqrt(r2) : 1.0);
   n >>= 1;
   switch (n)
   {
     case 1:
     ....
     case 12:
        Sq(Sq(r6));

   }
   return result;
}

The good thing is that my function appears fast in preliminary testing. The bad news is that it's not very ubiquitous and is very long as I need case statements for int powers from 8 to 50 or so (potentially even higher in the future). Further each case I had to examine and try different combinations to find by brute force derivation which combination of r2, r4, and r6 yielded the least multiplications

Does anyone have a more ubiquitous solution for a pow(double, int) replacement that uses precalculated powers of the base to cut the number of necessary multiplications, and/or have a a ubiquitous theory of how you can determine the ideal combination to produce the least multiplications for an arbitrary n and some set of precalculated multiples??

@KerrekSB Is that standard terminology? Neither me nor google seem to have heard about it before. I suppose you are referring to [exponentiation by squaring](http://en.wikipedia.org/wiki/Exponentiation_by_squaring) ? — us2012, Sep 22 '13 at 15:21
Is it? I'm unfamiliar with that method -- I did a quick Google search and didn't find anything. The key things here is that some lower level powers are already known (and moreover mandatory) and can be used to accelerate the hunt for higher order multiples. This is a little different from the situation in which nothing is precalculated, I believe. — Jason R. Mick, Sep 22 '13 at 15:21
Thanks us2012, yep, it's a modified spin on `int` exponentation by squaring. The key modification is that you have three power bases -- (x^2), (x^4), and (x^6) that are precalculated and can be combined to produce the fastest result. — Jason R. Mick, Sep 22 '13 at 15:22
What is maximum value of n? And how many times this pow function will be called? — Rafi Kamal, Sep 22 '13 at 15:24
`n<100` in all cases; in my current models `n=[10,44]` I've covered `n=[2,50]` in my case statement. — Jason R. Mick, Sep 22 '13 at 15:25
So basically, given x^2, x^4 and x^6, you have to calculate x^n (n < 100) with minimum number of computations? — Rafi Kamal, Sep 22 '13 at 15:31
Yes, that is correct Rafi, @Brian, I will review the section in Knuth, thanks. — Jason R. Mick, Sep 22 '13 at 15:32
@JasonR.Mick I think exponentiation by squaring is the fastest method... you will need 12 multiplication in the worst case (for 63) — Rafi Kamal, Sep 22 '13 at 16:01
Yes, but the issue is complicated as you can do squaring of the intermediate lower powers you have on hand from the previous function. e.g. normally you'd do `double r4=Sq(Sq(r)); r12=Sq(r4)*r4;`, but in this case you already have r4... or a more complicated example is `double r4=Sq(Sq(r)), r20; r20=Sq(Sq(r4))*r4;//5 ops` but we can do it in `r20=Sq(r6*r4); //3 ops` given that we've already calculated stuff. — Jason R. Mick, Sep 22 '13 at 17:10

score 1 · Answer 1 · answered Sep 22 '13 at 16:37

1

Here's a somewhat DP-like algorithm that will give you the minimum number of multiplications for a given n and available powers x^i, as well as the optimal strategies via backtracking. To each possible exponent n, associate a pair (minimum number of multiplications to get here, type of multiplication that gets you there) where for the second number, simply write i or a special symbol S for squaring.

You obviously start at 1 -> (0, /).

Given n -> (m_n, Action_m), set n+i -> to (m_n + 1, i) if m_n + 1 is smaller than a possibly previously computed minimum number of moves to n+i. Similarly, set 2n -> (m_n + 1, S)if this is better than a possible previous solution.

This algorithm gives you optimal strategies in roughly O(n_max * #available powers) . I don't claim that the algorithm itself is optimally efficient though, it certainly makes no sense to use this 'on the fly'. It's only useful if you have a reasonable n_max (100, in your case, is certainly okay) and an efficient way to store the strategies.

Two thoughts to consider:

(1) Until this is benchmarked, I'm not convinced it will result in a great performance improvement over standard exp by squaring (heavily dependent on the available powers, of course).

(2) The numerical error behaviour of such strategies (as well as exp by squaring) is completely different from pow(double, double).

answered Sep 22 '13 at 16:37

us2012

16,083
3
46
62

To answer #1, I added benchmark info... it's nearly 40% faster @ runtime, via a 50% theoretical reduction to multiplications.... again this is the most call piece of code... called probably 1,000+ times per cycle, so a small reduction to the # of mult. goes a long ways.... Can you add actual code to implement your algorithm... I'm a bit confused @ how to translate your suggestion to code that would be shorter than my previous brute force derivation... want to see something simple and ubiquitous... if you can give that, I will check your answer. But thanks for the feedback so far.... :) – Jason R. Mick Sep 23 '13 at 04:29
@Jason I'm suddenly not so sure I understand your requirements any more. I don't think there is something "simple **and** ubiquitous". You can choose one: Simple - exp by squaring or cmath's `pow`, or ubiquitous, precompute strategies and hard-code them (as in your solution so far) or dynamically evaluate/interpret them. – us2012 Sep 23 '13 at 10:59
The algorithm outlined in my answer will give you either (1) a way to find more best strategies to hard-code if your `n` gets larger or you have more potential powers available or (2) in the case that the set of available powers might change every few seconds or every program invocation (so that you can't hard-code them), a way to dynamically store strategies. The dynamic evaluation will likely incur a performance penalty due to a higher branch frequency though. It certainly won't give you shorter code. – us2012 Sep 23 '13 at 11:03
No I'm pretty sure you could. You start with your set of known powers `[x^2, x^a,...x^n]` now expand it to be within your range `[2,Mx]`-->`[1,Mx>>1]`... i.e. `A=[x^2...x^n,Sq(x^2,1), Sq(x^2,2).... Sq(x^n,p)]` such that `Ai<=Mx` for all members of the set. Now search possible sums in the set that yield an exponent, and then search for the one with the least terms. This (in theory) could be done in the preprocessor with some fancy macro magic if you set `#define Mx 100` as the basis... – Jason R. Mick Sep 23 '13 at 12:46
Even if you couldn't directly write it, you could write a script or secondary `C` program to output the necessary case state with the minimum number of terms in `A` which when they have the exponents summed produce the desired exponent. – Jason R. Mick Sep 23 '13 at 12:50
@Jason I'll think about that for a bit. Maybe someone with a better idea will come along in the meantime? Since you're talking about preprocessor magic, can we safely assume that both the set of available powers **and** the maximum power you'd ever need are available at compile time rather than runtime? – us2012 Sep 23 '13 at 14:17
Yes, that is correct... let me put that in the question definition. :) Thanks! – Jason R. Mick Sep 23 '13 at 17:41

Fastest pow() replacement via modified exp. by squaring when lower powers are already calculated

1 Answers1