1

When parallelising an integrator using OpenCL - is it bad practice to have the whole loop in the kernel?

I'm attempting to move an RK4 integrator I've written in C++ into OpenCL so I can run the operations on a GPU - currently it uses OpenMP.

I need to run 10 million+ independent integration runs, with about 700 loop iterations for each run. I currently have the loop written into the kernel with a stop condition, but its not performing as well as I'd have expected.

Current CL Kernel snippet:

`
while (inPos.z > -1.0f){
        cnt++;
        //Eval 1

        //Euler Velocity
        vel1 = inVel + (inAcc * 0.0f);
        //Euler Position
        pos1 = inPos + (vel1 * 0.0f) + ((inAcc * 0.0f)*0.5f);

        //Drag and accels
        combVel = sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2));
        //motionUtils::drag(netForce, combVel, mortSigma, outPos.z);
        dragForce = mortSigma*1.225f*pow(combVel, 2);
        //Normalise vector
        normVel = vel1 / combVel;
        //Drag Components
        drag = (normVel * dragForce)*-1.0f;
        //Add Gravity force
        drag.z+=((mortMass*9.801f)*-1.0f);
        //Acceleration components
        acc1 = drag/mortMass;

        ...

        //Taylor Expansion
        tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
        inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
        tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);

        //Swap ready for next iteration
        inPos = inPos + (tayVel * timeStep);
        inVel = inVel + (inAcc * timeStep);

` Any thoughts / suggestions, much appreciated.

Dusted
  • 123
  • 2
  • 10

1 Answers1

1

Try faster(and less precise) versions of slow function:

sqrt(pow(vel1.x, 2)+pow(vel1.y, 2)+pow(vel1.z, 2))

to

native_rsqrt(vel1.x*vel1.x+vel1.y*vel1.y+vel1.z*vel1.z)

 normVel = vel1 / combVel;

to

 normVel = vel1 * combVel;

 dragForce = mortSigma*1.225f*pow(combVel, 2);

to

 dragForce = mortSigma*1.225f*(combVel*combVel);

    drag = (normVel * dragForce)*-1.0f;
    //Add Gravity force
    drag.z+=((mortMass*9.801f)*-1.0f);

to

    drag = -normVel * dragForce;
    //Add Gravity force
    drag.z-=mortMass*9.801f;

    tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (1.0f/6.0f);
    inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (1.0f/6.0f);
    tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (1.0f/6.0f);

to

    tayVel = (vel1+((vel2+vel3)*2.0f)+vel4) * (0.166666f);
    inAcc = (acc1+((acc2+acc3)*2.0f)+acc4) * (0.166666f);
    tayPos = (pos1+((pos2+pos3)*2.0f)+pos4) * (0.166666f);

if you are using too many variables, try decreasing local workgroup size from 256 to 128 or 64 and if they are not being used out of loop, put their declaration in the loop so more threads can be issued at the sametime.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97