7

I am trying to write MergeSort in OpenCL (I know, BitonicSort is faster, but I want to compare them) and currently I have came accross a strange problem:

If I set global size to 1 << 24 and local size to 512, the kernel just fails to being executed and the next enqueued kernels as well. However, I don't get any kind of error neither when enqueuing the kernel or waiting until the queue gets finished. Nothing, just the kernel is not executed. ComputeProfiler shows it as well - no kernel. However, with global size 1 << 23 the algorithm works well. With local size 256 the minimum failing global size is 1 << 23.

Why does that happen? I thought there could be at least 65535 workgroups (according to NVidia Programming Guide), rounded down to nearest power of two it is 32768 == 1 << 15, with local size 512 == 1 << 9 this means that having global size 1 << 24 should be still OK. Moreover, I can execute another kernel with this global and local size.

And most of all, there's no error, I cannot detect that this has happened. Probably I'll have to make some workaround (looping in the workgroups manually over the large set) but I want to understand the problem.

Thanks for any suggestions

PS: I use NVidia GTX 580 on a Linux machine with drivers 260.19.26.

Radim Vansa
  • 5,686
  • 2
  • 25
  • 40
  • 4
    Could you please post your kernel code? – aland Sep 14 '11 at 16:35
  • @aland: well, I should backup the code if I ask a question, shouldn't I? I have already rewritten it, sorry :-/ I remember from the profiler that it used 23 registers, kernel had 6 arguments (4 memory buffers, 2 integers), used a little local memory buffer (2 * sizeof(float) * workgroup size)... – Radim Vansa Sep 14 '11 at 17:37
  • it just looks very much like compiler bug, so having the code might be helpful for checking it reproducibility. Well, if your rewritten code does the same and works, then so be it :) – aland Sep 14 '11 at 17:48
  • *** compiler... I wonder what allocation strategy does it use - now I've found that if I put my code into a simple loop, the number of used registers grows from 25 to 47. – Radim Vansa Sep 14 '11 at 19:07

0 Answers0