OpenCL only show 6 parallel comute units on GTX 760?

Question

I think this might be a very stupid questions, but I'm very new to OpenCL and just got it running on my desktop computer with a GTX 760 GPU.

Now when I query for OpenCL's CL_DEVICE_MAX_COMPUTE_UNITS it says there are 6 on the GPU. Yet on the on-board gpu (Intel HD Graphics 4600) it says that there are 20.

This seems a little disapointing as I would expect the GTX to have many more then the on-board GPU?

Or does the CL_DEVICE_MAX_COMPUTE_UNITS not translate directly to number of cores?

The gtx 760 reads 6 cl compute units, but i think that each one of those is actually a virtualization of what are 192 cores totaling to the advertised 6*192=1152 cores. Correct me if this isn't the case. That number alone isn't indicative of the performance of the card nor the amount of work you can distribute. Also there is simlar question see if it helps: http://stackoverflow.com/q/5679726 — user2464424, Feb 25 '16 at 16:10
That makes sense. Is there a way to query for the number of threads? That is, can I dynamically somehow find the number 1152 without knowing the specific card? — user1291510, Feb 25 '16 at 17:55
Apparently you can't get that number if what you are looking for is the mere specification detail. Do a web crawler for the wikipedia list if you need that info. CL_DEVICE_MAX_WORK_GROUP_SIZE will tell you the max allowed work-items count, but you can't know if the work you are issuing is executing in parallel or not and you have to "trust" the hardware. In other words, having 1152 threads doesn't mean that those are getting executed each in its own core. — user2464424, Feb 25 '16 at 18:27

score 3 · Accepted Answer · answered Feb 25 '16 at 19:02

You tend to think, "Oh how many cores my device has?" "Therefore I will launch that many amount of threads."

That way of thinking is wrong for cases like OpenCL/CUDA.

A core contains a limited amount of resources, memory and threads. Depending on how much each "thread" is going to use (therefore, depending on the code/kernel), the core will be able to run different amount of threads concurrently.

So the first unknown is: "How many threads a core can run?", It is unknown until the code is compiled, and different version of a compiler/driver can lead to different results.

If you don't know how many threads per core, then what use is for you knowning "6x? = ?". You still don't know how many threads can run in parallel and you will never will. Of course you can get the maximum value, but that may not always be like that, so what use does it for real aplications?

You have to think that a GPU is an unkown amount of very simple workers, that can only be put to the same task in groups of X.

The only important question is "How many threads are going to work in parallel in the same group?". Because you can do some clever cooperation techniques so those threads run faster together. And that is the "work group size".

The other parameters are simply redundant. Will just make your app faster or slower. Or allow you to run multiple tasks concurrently. But it should not be a design parameter.

The same as the CPU clock speed, or L1 cache is not a design parameter in CPU programing. Or how many other app are running.

OpenCL only show 6 parallel comute units on GTX 760?

1 Answers1