How to share global variables (arrays) in an OpenCL kernel with several user defined functions

Question

I'm having a problem in my OpenCL kernel. I am trying to do Runge-Kutta 4 integration. I already implemented it in an OpenGL compute shader and it works and now I want do implement in OpenCL.

I think my issue has to do with not knowing how to correctly share a single instance of a global array in all my function calls because right now I'm having to send the pointers to the arrays as arguments for every function call and it seems to me that this actually creates a local copy in those functions because my current implementation works for small data sets but not big ones (they throw CL_OUT_OF_RESOURCES).

In my compute shader I declare all my global arrays like:

layout(std430, binding=0) buffer pblock { coherent volatile restrict vec4 mcPosition[]; };

layout(std430, binding=1) buffer vblock { coherent volatile restrict vec4 mcVelocity[]; };

And I can use them fine in my functions:

vec4 calculateAcceleration(int numPoints, int step, ...) {...}

void rk4Step(int numPoints, int index, float timeStepToUse, ...) {...}

void calculateError(int index) {...}

But on the OpenCL implementation the only way I know how to do it is like this (very slimmed down example):

void rk4Step(
    const __constant int* numPoints,
    const int index,
    const float timeStepToUse,
    const bool calculateHalfTimeStep,
    const __constant float* squaredSofteningFactor,
    const __constant float* gravitationalConstant,
    __global float4* kvel,
    __global float4* dydx,
    __global float4* kpos,
    __global float4* mcPositionHalf,
    __global float4* mcVelocityHalf,
    __global float4* mcPositionFull,
    __global float4* mcVelocityFull
    )
{
    ...

    // Actual time step
    if(!calculateHalfTimeStep)
    {
        mcVelocityFull[index] += (kvel[index] + (2.0f*kvel[index+numPoints[0]]) + (2.0f*kvel[index+numPoints[0]*2]) + kvel[index+numPoints[0]*3]) * (1.0f/6.0f);
        mcPositionFull[index] += (kpos[index] + (2.0f*kpos[index+numPoints[0]]) + (2.0f*kpos[index+numPoints[0]*2]) + kpos[index+numPoints[0]*3]) * (timeStepToUse/6.0f);
    }
    else
    {
        mcVelocityHalf[index] += (kvel[index] + (2.0f*kvel[index+numPoints[0]]) + (2.0f*kvel[index+numPoints[0]*2]) + kvel[index+numPoints[0]*3]) * (1.0f/6.0f);
        mcPositionHalf[index] += (kpos[index] + (2.0f*kpos[index+numPoints[0]]) + (2.0f*kpos[index+numPoints[0]*2]) + kpos[index+numPoints[0]*3]) * (timeStepToUse/6.0f);
    }
}

void calculateError(const int index, __global float4* scale)
{
    float partialError = 0.0f;
    partialError = fmax(partialError, fabs(deltaPos[index].x / scale[index].x));
}

// Adaptive step 4th order Runge-Kutta
__kernel
void main( const __constant float* timeStep, const __constant float* accuracy, const __constant int* maxSteps,
    __global float4* mcPosition, __global float4* mcVelocity, __global float4* scale)
{
    // Scaling used to monitor accuracy
    scale[index] = calculateAcceleration(bi, index, numPoints, 1, false,
        squaredSofteningFactor, gravitationalConstant,
        mcPositionHalf, mcPositionFull, kvel);

    scale[index] = fabs(mcVelocity[index]) + fabs(scale[index] * timeStep[0]);

    for(int step=1; step<=maxSteps[0]; ++step)
    {
        // Take two half steps
        rk4Step(numPoints, index, timeStep[0], true,
            squaredSofteningFactor, gravitationalConstant,
            mcPosition, mcVelocity);
        rk4Step(numPoints, index, timeStep[0], true,
            squaredSofteningFactor, gravitationalConstant,
            mcPosition, mcVelocity);

        // Take one full step
        timeStep[0] *= 2.0f;
        rk4Step(numPoints, index, timeStep[0], false,
            squaredSofteningFactor, gravitationalConstant,
            mcPosition, mcVelocity);

        // Evaluate accuracy
        calculateError(index, accuracy, scale, deltaPos);
    }
}

As you can notice, the difference is that in the compute shaders version I can just declare the shared global arrays at the top of the file and use them in any one of my functions.

But in the OpenCL kernel version I have to pass those arrays as arguments for every function invocation and for large data sets this gives me a CL_OUT_OF_RESOURCES error.

I think my issue has to do with the fact that even though I declared the arrays global, each function invocation tries to make a local copy of the arrays, but maybe I'm wrong. I assume this by reading the documentation and the same thing is pointed out by this question:

How many copies of a global variable declared inside an opencl kernel function is maintained in the global address space

So my question is: How do I truly share a global array between user defined functions and my OpenCL kernel?

Any function that is not a __kernel like void rk4Step() or void calculateError() — sleon, Apr 22 '16 at 15:42

huseyin tugrul buyukisik · Accepted Answer · 2016-04-23T09:08:23.983

1

The arrays you mentioned are passed as pointers, no reason to expect a local copy of whole array, there are also __constant parameters which would stop writes and also copies as __constant is read-only. Main reason of no-local-copy could be gpu-opencl implementations not having stack. People writing fake stacks to achieve fake recursivity but even that cannot be bigger than size defined in host codes.

When do you get "CL_OUT_OF_RESOURCES"? After changing __constant buffer size or __global size? Generally __constant has only 50-100 kB per GPU while __global can be as large as 1/4 of video memory per gpu per buffer. Even number of __constant parameters are limited. You could concatenate multiple constant arrays into single constant array to elliminate that. Query constant memory limitations for constant and global please. Start with clGetDeviceInfo using CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE.

Other cases:

Heap fragmentation ---> no big array. only smaller can be allocated as buffers. Are you sending concurrent kernels that use all vram(or constant vram)?
Local work group size is bigger than devices' (example: amd has 256 on gpu, nvidia has 1024) (is it at least a divider of global size)
Too many scalar registers, too many vector registers per thread or per smx/cu.

Test:

there are 1024 threads per group.
there are at least 7 float4 variables in functions.
each float4 is 16 bytes.
if each function uses those single variables (reading from any source),
each smx needs 112 kB which is more than it has (48 kB).
each thread needs 112 bytes just for float4s. There are scalar variables you use too. You could check it out with a profiler.

Help:

You could change/reorder things in kernel and functions so it needs less registers at anytime. Declare something only just before usage. Not at beginning. You could also re-use a register after it has finished its work(such as using v1 then instead of v2,v3, use v1 again in the name of v2,v3,v4 ...).
Decrease local work group size so less threads per smx means less register usage per smx. Even per thread usage is important but only for performance.
Sometimes as low as 32(or 64)-local-work-group-size could be advantagous despite leaving half of the cores idle, to gain more memory space per thread.
Inlining functions could increase register pressure too. Maybe you should decrease level of unrolling and inlining and try again.
Remove const(not __constant) keyword from rk4Step parameters. Maybe those are pre-allocated in __constant memory space per thread(not impossible)

edited Apr 23 '16 at 09:08

answered Apr 22 '16 at 16:06

huseyin tugrul buyukisik

11,469
4
45
97

I get the CL_OUT_OF_RESOURCES error when the __global arrays change to a bigger size. For example it works wih 128, 1024 and 8096 bodies, but not with 65536 or 81920 (those are all the sizes of the data sets I'm testing). – sleon Apr 22 '16 at 16:12
My queries resulted in:CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 65536, CL_DEVICE_LOCAL_MEM_SIZE: 49152, CL_DEVICE_MAX_CONSTANT_ARGS: 140733193388041 – sleon Apr 22 '16 at 16:20
Could you check device max local work group size and what you give? Also register usage per thread. – huseyin tugrul buyukisik Apr 22 '16 at 16:29
Yes, it's 1024. And that is the maximum size I am dispatching as well. How do I query the register usage per thread? – sleon Apr 22 '16 at 16:35
I know only profilers could know because that changes in runtime. Such as codexl for amd but idk for nvidia. Try decreasing local size to 256 or 192 this should decrease register spillage per smx. Especially if it is older card because having less local memory/registers per smx. – huseyin tugrul buyukisik Apr 22 '16 at 16:39
That is a good idea, I have an older card (nVidia 650M) and I tried several different local sizes like 32, 64, 128, 192, 256, 512 and all gave me the same CL_OUT_OF_RESOURCES error. I have nVidia visual profiler but I don't know how to use it, I guess I'm gonna have to learn. – sleon Apr 22 '16 at 17:02
Than it is probably per-thread limitation. Try to decrease usage of float4 s. – huseyin tugrul buyukisik Apr 22 '16 at 17:05
Did you check CL_DEVICE_GLOBAL_MEM_SIZE? This is the limitation of your global memory usage. – Robert Wang Apr 22 '16 at 20:47
I checked and it is CL_DEVICE_GLOBAL_MEM_SIZE 2147287040 I'm going to try huseyin's suggestions to try to solve this issue. – sleon Apr 22 '16 at 22:09
Huseyin, I have a little follow up question I hope you can answer: you said that each smx needs 112 kB which is more than it has (48 kB). Since I declare the arrays as __global, why would each smx use the 112 kB (or whatever size it's using counting my scalars) if the array is supposed to be in global memory? – sleon May 05 '16 at 16:32
As function parameters, pointers are pointing to global space or any other space but themselves are in private space. Also for every calculation before writing to memory, there are unnamed private registers used. – huseyin tugrul buyukisik May 05 '16 at 17:27

How to share global variables (arrays) in an OpenCL kernel with several user defined functions

1 Answers1