C++11: thread_local or array of OpenCL 1.2 cl_kernel objects?

Question

I need to run several C++11 threads (GCC 4.7.1) parallely in host. Each of them needs to use a device, say a GPU. As per OpenCL 1.2 spec (p. 357):

All OpenCL API calls are thread-safe75 except clSetKernelArg. 
clSetKernelArg is safe to call from any host thread, and is safe
to call re-entrantly so long as concurrent calls operate on different
cl_kernel objects. However, the behavior of the cl_kernel object is
undefined if clSetKernelArg is called from multiple host threads on
the same cl_kernel object at the same time.

An elegant way would be to use thread_local cl_kernel objects and the other way I can think of is to use an array of these objects such that i'th thread uses i'th object. As I have not implemented these earlier I was wondering if any of the two are good or are there better ways of getting things done.

A third way perhaps would be to use a mutex for a single cl_object and associate it with an event handler. Then the thread can wait till the event is finished. Not sure if this works though in multi-threaded situation...

score 3 · Accepted Answer · answered Oct 08 '12 at 00:15

3

The main question is if all these threads need to use the same kernel or if each one gets an own distinct kernel. Your idea to use either thread_local cl_kernel objects or an array of n kernel objects both result in n kernel objects being created and are equally well from OpenCL's perspective. If they all contain the same code, though, then you unnecessarily waste space/cause context switches/mess up caching/... and would be comparable to loading an application binary into memory multiple times without sharing the constant binary code segments.

If you actually want to use the same kernel from within multiple threads then I'd suggest to perform manual synchronization on a single cl_kernel object. If you don't want your threads to block waiting until other threads completed their work you can use asynchronous command queuing and events to get notified once the work of a particular thread is done (to prevent a thread to queue work faster than the GPU can process it or to read back results of course).

If your threads shall execute different kernel programs then I suggest to create a separate command queue per thread to simplify execution. Is it totally up to you then if you chose to store these object handles in thread local storage, in a global array or elsewhere.

answered Oct 08 '12 at 00:15

Stacker

1,080
14
20

1

Yes, all the threads need to use the same kernel. I see that for using a single kernel with asynchronous command queuing is that I would have to a CPU intensive active wait or use C++11 condition_variable. Perhaps condition_variable is a better choice, isn't that so? Thanks for the detailed reply. – Quiescent Oct 08 '12 at 15:44
2

active waits (aka busy waits) are rarely a good choice. condition_variables (aka signals) always work in combination with a mutex which is probably already sufficient on its own in this case (a simple critical section pattern where only the one thread which gets the lock on the mutex may enter while all others sleep until they get the lock). But the exact locking strategy and if its worth using another approach e.g. using two cl command queues, one for kernel execution and one for memory up/downloading, depends on your specific problem which can't be answered at this point. – Stacker Oct 09 '12 at 00:53
Assuming that the kernels do take 'decent' amount of time to complete the work, won't a mutex will serialize the code? If so won't a mechanism through which all the jobs are submitted at once may improve performance as all work items would be executing the same kernel compatable with the SIMD nature of the GPU that requires that all the work-items in a work-group execute the same instruction? – Quiescent Oct 09 '12 at 02:09
1

Executing a Kernel already starts multiple "threads" in parallel on the GPU, depending on the size of your local work group. You should always chose the largest work group possible to get as much work done in parallel as possible. This also means that executing multiple kernels in parallel is not benificial or even possible (because all processors should ideally be assigned to a single kernel). – Stacker Oct 09 '12 at 12:16
3

From another perspecive: If these kernels take a decent amount of time and your code is dependent on getting some result of your executed kernels, then you have to wait until these kernels are finished, no matter what. The only thing you can do is to queue work in the work queue asynchronously (you still have to use a mutex to only allow one thread to queue work), associate an event with it and then perform some other work on your CPU threads (if there is any) until the event signals that the work of your GPUs is completed. Again,this depends on your exact algorithm and who is dependent on who – Stacker Oct 09 '12 at 12:20
Thanks Stacker. It was a very useful set of advice you have given. :) – Quiescent Oct 09 '12 at 15:15

C++11: thread_local or array of OpenCL 1.2 cl_kernel objects?

1 Answers1