0

If I use a barrier (no matter if CLK_LOCAL_MEM_FENCE or CLK_GLOBAL_MEM_FENCE) in my kernel, it causes a CL_INVALID_WORK_GROUP_SIZE error. The global work size is 512, the local work size is 128, 65536 items have to be computed, the max work group size of my device is 1024, I am using only one dimension. For Java bindings I use JOCL. The kernel is very simple:

kernel void sum(global float *input, global float *output, const int numElements, local float *localCopy
{
    localCopy[get_local_id(0)] = grid[get_global_id(0)];
    barrier(CLK_LOCAL_MEM_FENCE); // or barrier(CLK_GLOBAL_MEM_FENCE)
}

I run the kernel on the Intel(R) Xeon(R) CPU X5570 @ 2.93GHz and can use OpenCL 1.2. The calling method looks like

kernel.putArg(aCLBuffer).putArg(bCLBuffer).putArg(elementCount).putNullArg(localWorkSize);
queue.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize);

But the error is always the same:

[...]can not enqueue 1DRange CLKernel [...] with gwo: null gws: {512} lws: {128} 
cond.: null events: null [error: CL_INVALID_WORK_GROUP_SIZE]

What I am doing wrong?

Michael Dorner
  • 17,587
  • 13
  • 87
  • 117
  • I'm not too familiar with the JOCL bindings, but the call: putNullArg(localWorkSize) seems to be allocating bytes of local memory. I think you need to try: putNullArg(localWorkSize * 4), or the java equivalent of putNullArg(localWorkSize * sizeof(float)) – mfa Oct 09 '14 at 12:38
  • unless localWorkSize is already multiplied.. – mfa Oct 09 '14 at 12:43
  • Thank you for your comment, but unfortunately it doesn't solve the problem. The JOCL documentation says nothing about byte size ( http://goo.gl/ALkBLw ), just `size`, so I assume (and trying says also), that it is not the problem. – Michael Dorner Oct 09 '14 at 12:46
  • 1
    does the error happen for other group sizes? 1, 16, 32 etc? – mfa Oct 09 '14 at 12:54
  • No, for local work size = 1 it does work, but not for > 1. – Michael Dorner Oct 09 '14 at 13:10
  • I run the kernel on the `Intel(R) Xeon(R) CPU X5570 @ 2.93GHz`and can use OpenCL 1.2 – Michael Dorner Oct 09 '14 at 14:06
  • Out of interest, is this an Apple system? I've never seen this on Intel/AMD platforms, so just want to make sure that is still the case. – jprice Oct 09 '14 at 14:22
  • Yes, it is an Apple system, but with Intel Xeon processors. – Michael Dorner Oct 09 '14 at 21:17

1 Answers1

6

This is expected behaviour on some OpenCL platforms. For example, on my Apple system, the CPU device has a maximum work-group size of 1024. However, if a kernel has a barrier inside, then the maximum work-group size for that specific kernel is reduced to 1.

You can query the maximum work-group size for a specific kernel by using the clGetKernelWorkGroupInfo function with the CL_KERNEL_WORK_GROUP_SIZE parameter. The value returned will be no more than the value returned by clGetDeviceInfo and CL_DEVICE_MAX_WORK_GROUP_SIZE, but is allowed to be less (as it is in this case).

jprice
  • 9,755
  • 1
  • 28
  • 32
  • 1
    Are you sure this is "expected behaviour"? Do you have a standard quote? Unless I'm missing something, such behaviour would render `barrier` completely useless. – user703016 Oct 09 '14 at 14:15
  • 3
    @Cicada The OpenCL specification allows specific kernels to have maximum work-group sizes that are smaller than the device's maximum. This happens frequently for other reasons too - if your kernel uses lots of registers on a GPU, the maximum work-group size can be reduced. By 'expected behaviour', I mean that the behaviour is allowed, and some platforms are known to do this. I agree that this is a pain for developers, since it does mean that you can't do any useful synchronisation. – jprice Oct 09 '14 at 14:17
  • @jprice Oh, I see. That's a bummer indeed. +1 – user703016 Oct 09 '14 at 14:19
  • So, if you need synchronization, you can't launch a work group of more than 1 work item, but synchronizing a single work item is functionally equivalent to having no synchronization at all. In other words, that device doesn't support synchronization, is that right? – user703016 Oct 09 '14 at 14:30
  • 3
    @Cicada Yes, it basically implies that the implementors couldn't be bothered to implement any synchronisation for that particular platform. This might mean that each work-item is mapped onto a different thread, which would make synchronisation expensive. Other CPU implementations map a work-*group* onto a thread and then serialise work-item execution within that thread, which makes synchronisation much easier and cheaper. – jprice Oct 09 '14 at 14:33
  • That makes sense, thanks :) I always thought the Intel implementation mapped work items to SSE SIMD lanes. – user703016 Oct 09 '14 at 14:34
  • 2
    @Cicada Yes Intel's implementation does indeed do that, with the entire work-group executing within a single thread. I suspect the OP is using Apple's OpenCL implementation, which isn't so intelligent. – jprice Oct 09 '14 at 14:36
  • 2
    @jprice I experimented myself on multiple platforms and I knew that kernel WG size is sometimes smaller (registers, etc). But I never encountered a case were it would be 1. I think this is only applicable to CPU OpenCL emulations. However for a good practice, the developer should always read the value provided by the API, and calculate the best size taking that into account. Its a good way of having multiplatform fitted adjust of the kernels. – DarkZeros Oct 09 '14 at 16:48