Understanding OpenCL paradigm

Question

OpenCL abstracts the hardware with device objects. Each device object is comprised of different compute units and each compute unit has a certain number of working elements.

How are these 3 concepts mapped physically?

Let's take a graphic card as an example.

This is my guess:

device -> graphic card

compute units -> graphic card cores

working elements -> single cells of vector alus inside the graphic card cores (stream cores)

What I read from different OpenCL tutorials is that we should divide our problem data in a 1, 2 or 3 dimensional space and then assign a piece of that n dimensional data to each working group. A working group is then executed in the stream cores inside a single compute unit.

If my graphic card has 4 compute units, does this mean that I can have at most 4 working groups? Every compute unit in my graphic card has 48 stream cores. Again does this mean that I should create working groups with at most 48 elements? Multiples of 48?

I guess that OpenCL has some kind of scheduler that allows us to use a lot more working groups and threads than available hardware resources, but I think that the real parallelism is accomplished as I stated above.

Have I got the OpenCL paradigm right?

If you want to be a GPU driver developer, then yes. But as far as you are user of the API, then you simply need to understand that you have work-groups and work-items. And you can either perform your work item-by-item (a work-item process an element of input->output) or work-group by work-group, where each work-group cooperates to give an output value. The only limits in the HW are the sizes of the work-groups and local memory of the work-groups. The number of these that can be run is limitless. What you said is valid for a OpenCL GPU, but not for a CPU, nor a FPGA, etc... — DarkZeros, Dec 16 '13 at 10:14
Ok so I could just ignore the hardware details and just stick with the high level paradigm. Just one thing about local memory. Is this memory located inside every compute unit? Because I read this it is shared among working elements in a work group. If I use a working group of 500 elements (against 48), does the api creates some 'magic' to share the local memory in this bigger group? — Kami, Dec 16 '13 at 10:22
If the device allows you to make a work group of 500 elements then it does perform some sort of additional synchronization. However the device can just say "Naah not gonna do it" and simply refuse to run with local_size of 500. You can query the maximum local_size by calling clGetKernelWorkGroupInfo with CL_KERNEL_WORK_GROUP_SIZE. — sharpneli, Dec 16 '13 at 11:44
possible duplicate of [Are OpenCL work items executed in parallel?](http://stackoverflow.com/questions/8980044/are-opencl-work-items-executed-in-parallel) — Paul Sweatte, Sep 17 '15 at 14:07

Understanding OpenCL paradigm

0 Answers0