EDIT:
Goal :
Generate a ubiquitous method for deriving a custom power function that outperforms the built-in pow(double, uint)
by reusing precalculated/cached powers from power calculations on common variables.
What's already been done:
I've already derived such a function that's roughly 40% faster than the built-in, however this is a brute-force hand-derived function -- I want a method for autogenerating such a power function block for an arbitrary uint
power.
KNOWNS
To derive an optimal custom pow(double, uint)
you need some knowns. For this question the knowns (to clarify) are:
- The power will be an integer.
- The maximum the power can be is known (
N_MAX
). - The precalculated powers that can be (re)used are known
at compile time (e.g. in my example
r2
,r4
, andr6
). - The square
r2
can be assumed to always have been calculated regardless of the other precalculated powers.
SOLUTION REQUIREMENTS
An optimal solution requiring a separate program to write a case
lookup table or preprocessor logic to generate such a table is acceptable, however, non-optimal solutions using hand-generated (i.e. brute force derived) lookup tables using the powers on hand will not be accepted (as I have that already, and show that in my example... the idea is to get away from this).
POSSIBLE SOLUTION ROUTE
As a suggestion, you know N_MAX
and a set of powers that are precalculated B
(B={2,4,6}
for my example). You can produce either in a separate program or in the preprocessor a table of all squares of Sq(Bi, x
) <= N_MAX. You can use this to form a basis set
A, which you then search somehow to determine the least number of terms that can be summed to produce an arbitrary exponent of
n>>1, where
n<=N_MAX` (the shift is due to that we take care of the odd case by checking the LSB and multiplying by the sqrt(r2)).
THEORETICAL BACKGROUND
I believe formally the below method is a modified version of exponentations by squaring:
http://en.wikipedia.org/wiki/Exponentiation_by_squaring
....which takes advantage of the fact that certain lower order powers are already by necessity precalculated, hence it shifts the optimal set of multiplications from a vanilla exponentation by squaring (which I assume pow(double, int)
uses).
However there are significant savings by using the stored small power intermediates instead of simple exp. by squares on the r2
.
THEORETICAL PERFORMANCE
For example, for one set of objects n=14
.... in this scenario exp. by powers gives
double r4 = Sq(r2), r14=Sq(r4)*r4*r2; //4 op.
... which takes 4 FP multiplications..... but using the r2
and r6
we have
double r14=Sq(r6)*r2; //2 op.
.... 2 FP multiplications.... in other words, by going from "dumb" exponentation by squares to my modified exp. by squares using the common exponent precaching, I've cut my cost of calculations for 50% in terms of multiplications ... at least until memory costs are considered.
REAL PERFORMANCE
With my current method (compiled with gcc -O3
) I get 35.1 sec. to run 1 million cycles of my program, versus (w/ no other modifications) 56.6 s using the built int pow(double, int)
.... so almost the theoretical speedup.
At this point you may be scratching your head at how a 50% cut in multiplications on a single instruction line can deliver a ~40% speedup. But basically this line of code is called 1,000+ times per cycle and is by far the most evaluated/most expensive line of code in the entire program. Hence the program appears highly sensitive to a small optimization/improvement in this chunk.
ORIGINAL POST and EXAMPLE CODE
I need to replace the pow(double, int)
function as I already have calculated a 6th power term and have 2nd, 4th power intermediates saved, all of which can be used to reduce multiplications in the second pow
call, which uses the same double
base.
More specifically, in my c++ code I have a performance critical calculation snippet of code where I raise the reciprocal of the distance between 3D points to the 6th power and nth power. e.g.:
double distSq = CalcDist(p1,p2), r2 = a/distSq, r6 = r2 * r2 * r2;
results += m*(pow(sqrt(r2), n) - r6);
Where m
and a
are constants related to the fitted equation and n
is the arbitrary power.
A slightly more efficient form is:
double distSq = CalcDist(p1,p2), r2 = a/distSq, r6 = r2 * r2 * r2;
results += m*(pow(r2, n)*(n&0x1?sqrt(r2):1.0) - r6);
However, this is also not optimal. What I've found to be significantly faster is to have a custom pow
function that uses the multiples r2, r4, and r6, which I have to calculate already anyways for the second term.
e.g.:
double distSq = CalcDist(p1,p2), r2 = a/distSq, r4 = r2 * r2, r6 = r4 * r2;
results += m*(POW(r2, r4, r6 n) - r6);
Inside the function:
double POW(double r2, double r4, double r6, uint n)
{
double results = (n&0x1 : sqrt(r2) : 1.0);
n >>= 1;
switch (n)
{
case 1:
....
case 12:
Sq(Sq(r6));
}
return result;
}
The good thing is that my function appears fast in preliminary testing. The bad news is that it's not very ubiquitous and is very long as I need case
statements for int
powers from 8
to 50
or so (potentially even higher in the future). Further each case I had to examine and try different combinations to find by brute force derivation which combination of r2
, r4
, and r6
yielded the least multiplications
Does anyone have a more ubiquitous solution for a pow(double, int)
replacement that uses precalculated powers of the base to cut the number of necessary multiplications, and/or have a a ubiquitous theory of how you can determine the ideal combination to produce the least multiplications for an arbitrary n
and some set of precalculated multiples??