6

I'm working on an algorithm that does prettymuch the same operation a bunch of times. Since the operation consists of some linear algebra(BLAS), I thourght I would try using the GPU for this.

I've writen my kernel and started pushing kernels on the command queue. Since I don't wanna wait after each call I figures I would try daisy-chaining my calls with events and just start pushing these on the queue.

call kernel1(return event1)
call kernel2(wait for event 1, return event 2)
...
call kernel1000000(vait for event 999999)

Now my question is, does all of this get pushed to the graphic chip of does the driver store the queue? It there a bound on the number of event I can use, or to the length of the command queue, I've looked around but I've not been able to find this.

I'm using atMonitor to check the utilization of my gpu' and its pretty hard to push it above 20%, could this simply be becaurse I'm not able to push the calls out there fast enough? My data is already stored on the GPU and all I'm passing out there is the actual calls.

Martin Kristiansen
  • 9,875
  • 10
  • 51
  • 83

2 Answers2

5

First, you shouldn't wait for an event from a previous kernel unless the next kernel has data dependencies on that previous kernel. Device utilization (normally) depends on there always being something ready-to-go in the queue. Only wait for an event when you need to wait for an event.

"does all of this get pushed to the graphic chip of does the driver store the queue?"

That's implementation-defined. Remember, OpenCL works on more than just GPUs! In terms of the CUDA-style device/host dichotomy, you should probably consider command queue operations (for most implementations) on the "host."

Try queuing up multiple kernels calls without waits in-between them. Also, make sure you are a using an optimal work group size. If you do both of those, you should be able to max out your device.

Ryan Marcus
  • 966
  • 8
  • 21
1

Unfortunately i don't know the answers to all of your questions and you've got me wondering about the same things now too but i can say that i doubt the OpenCL queue will ever become full since you GPU should finish executing the last queued command before at least 20 commands are submitted. This is only true though if your GPU has a "watchdog" because that would stop ridiculously long kernels (i think 5 seconds or more) from executing.

A Person
  • 801
  • 1
  • 10
  • 22
  • Ok, can you tell me where you know this from? I'm trying to figure out the actual specs of opencl, but its not easy at all.(I'm actual considering switching to CUDA). Is what your saying that the driver bundles up the commands and sends them to the gpu in large chunks? – Martin Kristiansen Aug 12 '11 at 09:42
  • 1
    I think the driver does bundle the commands since according to the OpenCL documentation clFinish blocks until all commands in the passed in command queue have finished executing so unless you call clFinish OpenCL is going to decide when commands are executed. However calls to clFinish are expensive and should be avoided, i would still give it try though. Have you considered that your GPU is possibly fast enough to execute your computation without ever needing 100% of the power? The only other thing i can think of is that OpenCL limits GPU usage so that you're computers display won't lock up – A Person Aug 13 '11 at 01:12