I'm having a problem in my OpenCL kernel. I am trying to do Runge-Kutta 4 integration. I already implemented it in an OpenGL compute shader and it works and now I want do implement in OpenCL.
I think my issue has to do with not knowing how to correctly share a single instance of a global array in all my function calls because right now I'm having to send the pointers to the arrays as arguments for every function call and it seems to me that this actually creates a local copy in those functions because my current implementation works for small data sets but not big ones (they throw CL_OUT_OF_RESOURCES).
In my compute shader I declare all my global arrays like:
layout(std430, binding=0) buffer pblock { coherent volatile restrict vec4 mcPosition[]; };
layout(std430, binding=1) buffer vblock { coherent volatile restrict vec4 mcVelocity[]; };
And I can use them fine in my functions:
vec4 calculateAcceleration(int numPoints, int step, ...) {...}
void rk4Step(int numPoints, int index, float timeStepToUse, ...) {...}
void calculateError(int index) {...}
But on the OpenCL implementation the only way I know how to do it is like this (very slimmed down example):
void rk4Step(
const __constant int* numPoints,
const int index,
const float timeStepToUse,
const bool calculateHalfTimeStep,
const __constant float* squaredSofteningFactor,
const __constant float* gravitationalConstant,
__global float4* kvel,
__global float4* dydx,
__global float4* kpos,
__global float4* mcPositionHalf,
__global float4* mcVelocityHalf,
__global float4* mcPositionFull,
__global float4* mcVelocityFull
)
{
...
// Actual time step
if(!calculateHalfTimeStep)
{
mcVelocityFull[index] += (kvel[index] + (2.0f*kvel[index+numPoints[0]]) + (2.0f*kvel[index+numPoints[0]*2]) + kvel[index+numPoints[0]*3]) * (1.0f/6.0f);
mcPositionFull[index] += (kpos[index] + (2.0f*kpos[index+numPoints[0]]) + (2.0f*kpos[index+numPoints[0]*2]) + kpos[index+numPoints[0]*3]) * (timeStepToUse/6.0f);
}
else
{
mcVelocityHalf[index] += (kvel[index] + (2.0f*kvel[index+numPoints[0]]) + (2.0f*kvel[index+numPoints[0]*2]) + kvel[index+numPoints[0]*3]) * (1.0f/6.0f);
mcPositionHalf[index] += (kpos[index] + (2.0f*kpos[index+numPoints[0]]) + (2.0f*kpos[index+numPoints[0]*2]) + kpos[index+numPoints[0]*3]) * (timeStepToUse/6.0f);
}
}
void calculateError(const int index, __global float4* scale)
{
float partialError = 0.0f;
partialError = fmax(partialError, fabs(deltaPos[index].x / scale[index].x));
}
// Adaptive step 4th order Runge-Kutta
__kernel
void main( const __constant float* timeStep, const __constant float* accuracy, const __constant int* maxSteps,
__global float4* mcPosition, __global float4* mcVelocity, __global float4* scale)
{
// Scaling used to monitor accuracy
scale[index] = calculateAcceleration(bi, index, numPoints, 1, false,
squaredSofteningFactor, gravitationalConstant,
mcPositionHalf, mcPositionFull, kvel);
scale[index] = fabs(mcVelocity[index]) + fabs(scale[index] * timeStep[0]);
for(int step=1; step<=maxSteps[0]; ++step)
{
// Take two half steps
rk4Step(numPoints, index, timeStep[0], true,
squaredSofteningFactor, gravitationalConstant,
mcPosition, mcVelocity);
rk4Step(numPoints, index, timeStep[0], true,
squaredSofteningFactor, gravitationalConstant,
mcPosition, mcVelocity);
// Take one full step
timeStep[0] *= 2.0f;
rk4Step(numPoints, index, timeStep[0], false,
squaredSofteningFactor, gravitationalConstant,
mcPosition, mcVelocity);
// Evaluate accuracy
calculateError(index, accuracy, scale, deltaPos);
}
}
As you can notice, the difference is that in the compute shaders version I can just declare the shared global arrays at the top of the file and use them in any one of my functions.
But in the OpenCL kernel version I have to pass those arrays as arguments for every function invocation and for large data sets this gives me a CL_OUT_OF_RESOURCES error.
I think my issue has to do with the fact that even though I declared the arrays global, each function invocation tries to make a local copy of the arrays, but maybe I'm wrong. I assume this by reading the documentation and the same thing is pointed out by this question:
So my question is: How do I truly share a global array between user defined functions and my OpenCL kernel?