OpenCL abstracts the hardware with device objects. Each device object is comprised of different compute units and each compute unit has a certain number of working elements.
How are these 3 concepts mapped physically?
Let's take a graphic card as an example.
This is my guess:
device -> graphic card
compute units -> graphic card cores
working elements -> single cells of vector alus inside the graphic card cores (stream cores)
What I read from different OpenCL tutorials is that we should divide our problem data in a 1, 2 or 3 dimensional space and then assign a piece of that n dimensional data to each working group. A working group is then executed in the stream cores inside a single compute unit.
If my graphic card has 4 compute units, does this mean that I can have at most 4 working groups? Every compute unit in my graphic card has 48 stream cores. Again does this mean that I should create working groups with at most 48 elements? Multiples of 48?
I guess that OpenCL has some kind of scheduler that allows us to use a lot more working groups and threads than available hardware resources, but I think that the real parallelism is accomplished as I stated above.
Have I got the OpenCL paradigm right?