I am trying to write MergeSort in OpenCL (I know, BitonicSort is faster, but I want to compare them) and currently I have came accross a strange problem:
If I set global size to 1 << 24
and local size to 512
, the kernel just fails to being executed and the next enqueued kernels as well. However, I don't get any kind of error neither when enqueuing the kernel or waiting until the queue gets finished. Nothing, just the kernel is not executed. ComputeProfiler shows it as well - no kernel. However, with global size 1 << 23
the algorithm works well. With local size 256
the minimum failing global size is 1 << 23
.
Why does that happen? I thought there could be at least 65535
workgroups (according to NVidia Programming Guide), rounded down to nearest power of two it is 32768 == 1 << 15
, with local size 512 == 1 << 9
this means that having global size 1 << 24
should be still OK. Moreover, I can execute another kernel with this global and local size.
And most of all, there's no error, I cannot detect that this has happened. Probably I'll have to make some workaround (looping in the workgroups manually over the large set) but I want to understand the problem.
Thanks for any suggestions
PS: I use NVidia GTX 580 on a Linux machine with drivers 260.19.26.