2

I'm following the example here to create a variable-length local memory array. The kernel signature is something like this:

__kernel void foo(__global float4* ex_buffer,
                 int ex_int,
                 __local void *local_var) 

Then I call clSetKernelArg for the local memory kernel argument as follows:

clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)

Where MaxSharedMem is set from querying CL_DEVICE_LOCAL_MEM_SIZE. Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE when I enqueue this kernel and then run clFinish on the queue.

This is a simple kernel that generates the mentioned error (local work size is 32):

__kernel 
void s_Kernel(const unsigned int N, __local void *shared_mem_block ) 
{
    const ushort thread_id = get_local_id(0);   
    __local double *foo = shared_mem_block;
    __local ushort *bar = (__local ushort *) &(foo[1000]);


    foo[thread_id] = 0.;
    bar[thread_id] = 0;
}

The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?

tavr
  • 81
  • 1
  • 6
  • What is the actual value of `MaxSharedMem` that you are using/getting? What happens if you reduce the `MaxSharedMem` variable to some smaller value, say 1000? – Robert Crovella Mar 24 '16 at 19:45
  • Host is not C# space is it? sizeof(char) 2 – huseyin tugrul buyukisik Mar 24 '16 at 22:18
  • Host is c++. `MaxSharedMem` is 48000 for the card I'm testing on, which is quadro K1000M. I tried setting it to a smaller value, but this had no effect. The only thing that allowed me to avoid the `CL_INVALID_COMMAND_QUEUE` error was statically allocating all my data structures in local mem, which is hardly workable for my kernel. – tavr Mar 25 '16 at 00:11
  • Apologies, card is quadro 1000M, not K1000M – tavr Mar 25 '16 at 04:03
  • Perhaps you should show a short complete example that is not working. [Here](http://pastebin.com/XN6xmRCV) is my sample, modified slightly from the vector add sample [here](https://www.olcf.ornl.gov/tutorials/opencl-vector-addition/) which seems to run correctly on an NVIDIA K40c. – Robert Crovella Mar 25 '16 at 20:32
  • added a simple kernel that demonstrates the error (on my gpu). – tavr Mar 28 '16 at 01:43
  • I was asking for a complete example, like the working example I provided. Not just the kernel. – Robert Crovella Mar 30 '16 at 02:40
  • @robert host code + kernel [here](http://pastebin.com/AaRSavQ1) – tavr Apr 06 '16 at 15:46
  • If I change `__local void *shared_mem_block` in your kernel definition to `__local double *shared_mem_block` your code runs without error for me. – Robert Crovella Apr 07 '16 at 01:08

1 Answers1

6

For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.

As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).

Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.

Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.

tavr
  • 81
  • 1
  • 6
  • 2
    Good on you for coming back and sharing details to help others. – Regular Jo Mar 19 '17 at 02:53
  • thank you, had the same issue, glad you posted the solution here. very helpful. https://devtalk.nvidia.com/default/topic/1057646/cuda-programming-and-performance/cl_invalid_command_queue-error-due-to-local-memory-byte-alignment/?offset=2#5363027 – FangQ Jul 19 '19 at 18:40