0

I have a kernel running in open CL (via a jocl front end) that is running horrible slow compared to the other kernels, I'm trying to figure why and how to accelerate it. This kernel is very basic. it's sole job is to decimate the number of sample points we have. It copies every Nth point from the input array to a smaller output array to shrink our array size.

The kernel is passed a float specifying how many points to skip between 'good' points. So if it is passed 1.5 it will skip one point, ten two, then one etc to keep an average of every 1.5 points being skipped. The input array is already on the GPU (it was generated by an earlier kernel) and the output array will stay on the kernel so there is no expense to transfer data to or from the CPU.

This kernel is running 3-5 times slower then any of the other kernels; and as much as 20 times slower then some of the fast kernels. I realize that I'm suffering a penalty for not coalescing my array accesses; but I can't believe that it would cause me to run this horribly slow. After all every other kernel is touching every sample in the array, I would think touching ever X sample in the array, even if not coalesced, should be around the same speed at least of touching every sample in an array.

The original kernel actually decimated two arrays at once, for real and imaginary data. I tried splitting the kernel up into two kernel calls, one to decimate real and one to decimate imaginary data; but this didn't help at all. Likewise I tried 'unrolling' the kernel by having one thread be responsible for decimation of 3-4 points; but this didn't help any. Ive tried messing with the size of data passed into each kernel call (ie one kernel call on many thousands of data points, or a few kernel calls on a smaller number of data points) which has allowed me to tweak out small performance gains; but not to the order of magnitude I need for this kernel to be considered worth implementing on GPU.

just to give a sense of scale this kernel is taking 98 ms to run per iteration while the FFT takes only 32 ms for the same input array size and every other kernel is taking 5 or less ms. What else could cause such a simple kernel to run so absurdly slow compared to the rest of the kernels were running? Is it possible that I actually can't optimize this kernel sufficiently to warrant running it on the GPU. I don't need this kernel to run faster then CPU; just not quite as slow compared to CPU so I can keep all processing on the GPU.

dsollen
  • 6,046
  • 6
  • 43
  • 84
  • What hardware is this running on? – talonmies Aug 29 '11 at 15:39
  • What happens if you, instead of skipping the samples, read them all? You can still drop the ones you don't want to keep afterwards. But the kernel should be at least as fast as the other ones then. – w-m Aug 29 '11 at 17:29
  • I'm runing on an old nvida card with 256 MB. I unfortunately don't know if I'm allowed to post the code due to excessive restrictions my company has about sharing code. – dsollen Aug 29 '11 at 20:33
  • I have thought about doing as W.M said. That was to be my next attempt before I got dragged away from work for 6 hours heh. I'll try it now and see what happens. – dsollen Aug 29 '11 at 20:34
  • trying W.M idea doesn't work; it slowed the process down noticable. However, I've noticed when I try timing in the CPU I only see a delay if I put my timer after the release call on the buffer that was decimated. Why would a release cause the program to stall, I though release only decremented a counter, not actually preform the release or wait for it to happen? – dsollen Aug 29 '11 at 22:50

1 Answers1

0

it turns out the issue isn't with the kernel at all. Instead the problem is that when I try to release the buffer I was decimating it causes the entire program to stall while the kernel (and all other kernels in queue) complete. This appears to be functioning incorrectly, the clrelease should only decrement a counter so far as I understand, not block on the queue. However; the important point is that my kernel is running efficiently as it should be.

dsollen
  • 6,046
  • 6
  • 43
  • 84