3

I am working on a gamma function that generates a "S-Curve". I need to run it in a realtime environment so i need to speed it up as much as possible.

The code is as follows:

float Gamma = 2.0f; //Input Variable

float GammaMult = pow(0.5f, 1.0f-Gamma);
if(Input<1.0f && Input>0.0f)
{
    if(Input<0.5f)
    {
        Output = pow(Input,Gamma)*GammaMult;
    }
    else
    {
        Output  = 1.0f-pow(1.0f-Input,Gamma)*GammaMult;
    }
}
else
{
   Output  = Input;
}

Is there any way I can optimize this code?

Humam Helfawi
  • 19,566
  • 15
  • 85
  • 160
user2339945
  • 623
  • 1
  • 9
  • 14

2 Answers2

3

You can avoid pipeline stalls by eliminating branching on Input<1.0f && Input>0.0f if the instruction set supports saturation arithmetic or use max/min intrinsics, e.g. x86 MAXSS

You should also eliminate the other branching via rounding the saturated Input. Full algorithm:

float GammaMult = pow(0.5f, 1.0f-Gamma);
Input = saturate(Input); // saturate via assembly or intrinsics
// Input is now in [0, 1]
Rounded = round(Input); // round via assembly or intrinsics
Coeff = 1 - 2 * Rounded
Output = Rounded + Coeff * pow(Rounded + Coeff * Input,Gamma)*GammaMult;

Rounding should be done via asm/intrinsics as well.

If you use this function on e.g. successive values of an array you should consider vectorising it if the target architecture supports SIMD.

Tamás Zahola
  • 9,271
  • 4
  • 34
  • 46
  • What is the advantage of rounding here? The original code doesn't appear to want the result rounded to an integer, and you're not eking out any more performance by avoiding floating-point ops in favor of integer ops as long as you use the `FRNDINT` instruction since that leaves the result on the floating point stack. – Cody Gray - on strike Jan 18 '16 at 13:04
  • @CodyGray rounding is used to generate the coefficients so that he won't need to branch on `Input < 0.5`. E.g.: `Coeff = 1 - 2 * Rounded` will be 1 if `Input < 0.5` and -1 if `Input > 0.5`, thus what was a branch in the original algorithm now becomes one round instruction, one floating point multiple and one floating point add --> pipelining won't suffer. – Tamás Zahola Jan 18 '16 at 13:07
  • Oh, of course! Very clever. I had missed that in my cursory reading. – Cody Gray - on strike Jan 18 '16 at 13:09
  • 1
    SIMDed `pow` is not very ubiquitous outside of high end compilers. – Rotem Jan 18 '16 at 13:10
0

Your code seems fine. The bottleneck, if exists, is the pow function. The only solution is to go a bit deeper into low-level details and try to implement your own pow function. For example if 2 float digits are sufficient for you, you may found some approximation-based algorithms which are faster.

See this: The most efficient way of implementing pow() function in floating point

Community
  • 1
  • 1
Humam Helfawi
  • 19,566
  • 15
  • 85
  • 160