3

I've been trying to design a fast binary exponentiation implementation in OpenCL. My current implementation is very similar to the one in this book about pi.

// Returns 16^n mod ak
inline double expm (long n, double ak)
{
    double r = 16.0;
    long nt;

    if (ak == 1) return 0.;
    if (n == 0) return 1;
    if (n == 1) return fmod(16.0, ak);

    for (nt=1; nt <= n; nt <<=1);

    nt >>= 2;

    do
    {
        r = fmod(r*r, ak);
        if ((n & nt) != 0)
            r = fmod(16.0*r, ak);
        nt >>= 1;
    } while (nt != 0);
    return r;
}

Is there room for improvement? Right now my program is spending the vast majority of it's time in this function.

Bruce Dean
  • 2,798
  • 2
  • 18
  • 30
AnimatedRNG
  • 1,859
  • 3
  • 26
  • 39
  • Any idea of what the general range of the input is? – cactus1 Jan 21 '14 at 01:52
  • n can range up to a million or so. So, a fairly large range of input. – AnimatedRNG Jan 21 '14 at 02:53
  • Is `ak` actually double or an integer? What is it's range? – Aki Suihkonen Jan 21 '14 at 07:07
  • Sorry for the delay - I'm back now. ak is (as far as I can tell) an integer. For some reason, this implementation treats it like a double (perhaps to avoid casting at some point?). I haven't actually tested the entire program with ak as an int instead of a double, so I don't know if making such a switch will reduce its accuracy or affect the performance in a later section. – AnimatedRNG Jan 22 '14 at 02:22

2 Answers2

2

My first thought is to vectorize it, for a potential speed up of ~1.6x. This uses 5 multiplies per loop compared to 2 multiplies in the original, but with approximately a quarter the number of loops for sufficiently large N. Converting all the doubles to longs, and swapping out the fmods for %s may provide some speed up depending on the exact GPU used and whatever.

inline double expm(long n, double ak) {

    double4 r = (1.0, 1.0, 1.0, 1.0);
    long4 ns = n & (0x1111111111111111, 0x2222222222222222, 0x4444444444444444,
            0x8888888888888888);
    long nt;

    if(ak == 1) return 0.;

    for(nt=15; nt<n; nt<<=4); //This can probably be vectorized somehow as well.

    do {
        double4 tmp = r*r;
        tmp = tmp*tmp;
        tmp = tmp*tmp;
        r = fmod(tmp*tmp, ak); //Raise it to the 16th power, 
                                       //same as multiplying the exponent 
                                       //(of the result) by 16, same as
                                       //bitshifting the exponent to the right 4 bits.

        r = select(fmod(r*(16.0,256.0,65536.0, 4294967296.0), ak), r, (ns & nt) - 1);
        nt >>= 4;
    } while(nt != 0); //Process n four bits at a time.

    return fmod(r.x*r.y*r.z*r.w, ak); //And then combine all of them.
}

Edit: I'm pretty sure it works now.

cactus1
  • 629
  • 5
  • 8
  • This looks awesome! Unfortunately, I don't have much time today to figure out exactly what your code is doing; I'll probably take a look tomorrow. – AnimatedRNG Jan 22 '14 at 02:27
  • Thanks! I think I have it working now. The gist of it is it uses vector types to process the input 4 bits at a time. – cactus1 Jan 22 '14 at 16:01
  • Hm... it doesn't seem to be working correctly. For example, in one of the first iterations on one of the first work items it has to calculate expm(6, 9) (in other words 16^6 % 9). expm() returns 9 when the answer is 1. – AnimatedRNG Jan 22 '14 at 22:15
  • Looks like I had the select set up backwards and forgot an fmod on the return statement, although I don't see how both of those could cause that error. Any other broken cases to report? – cactus1 Jan 22 '14 at 22:28
  • Ah, spotted it. I had the wrong numbers in the select(..) part of the code, the wrong powers of 16. – cactus1 Jan 22 '14 at 22:32
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/45845/discussion-between-animatedjuzz-and-cactus1) – AnimatedRNG Jan 22 '14 at 22:38
0
  • The loop to extract nt = log2(n); can be replaced by
    if (n & 1) ...; n >>= 1;
    in the do-while loop.
  • Given that initially r = 16;, fmod(r*r, ak) vs fmod(16*r,ak) can be easily delayed to calculate the modulo only every Nth iteration or so -- Loop unrolling?
  • Also why fmod?
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57