0

I am trying to run my program with OpenCL.

I have seen the following information in the log:

OpenCL device #0: GPU NVIDIA Corporation GeForce GT 730 with OpenCL 1.2 (2 units, 901 MHz, 4096 Mb, version 391.35)
OpenCL device #1: GPU NVIDIA Corporation GeForce GT 730 with OpenCL 1.2 (2 units, 901 MHz, 4096 Mb, version 391.35)
OpenCL device #2: CPU Intel(R) Corporation Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz with OpenCL 2.1 (8 units, 4000 MHz, 16300 Mb, version 7.0.0.2567)

What I guess from the above information, is that my GPU device has 2 units each as work item.

After checking the specification of my GPU device using CudaZ utility, I see that I have 384 Cores reported for a GPU device in a [ PCI_LOC=0:1:0 ].

See the image:

my GPU specification

The clinfo show the following: gist of clinfo

My question is that, when I am having 384 cores each, then why there are 2 units displayed? Secondly, when I have many cores, how openCL is distributing the task, is it on each core same process and same data or is it different core with different data?

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
  • Directly before the number 384 you see the number 2. As this is the only "2" in the overview you could easily guess what "unit" is referring to regarding your GPU. Anyway... afaik OpenCL uses the wordings of "compute elements", "... units" and "... devices". Nvidia corresponding wordings are "streaming processor", "streaming multiprocessor" and "GPUs" respectively. – BlameTheBits May 17 '18 at 07:59
  • @Shadow dear friend, I have got totally confused with the terminology. May I know what exactly does it means? Say whether it means that I will be able to compute with only 2 cores or is there anything else it means? – Jaffer Wilson May 17 '18 at 08:08
  • Have you used OpenCL and/or Nvidia GPUs before? If not, for the Nvidia part you may read chapters 1,2 and 4 of the [CUDA programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/) (yes, it is about CUDA, but I think these chapters contain basic knowledge even if you do not want to use CUDA). There are good questions here on SO too. Regarding the OpenCL part and espacially the link to Nvidia GPUs I cannot help you much as I am not really familiar with these topics. Maybe [this](http://sa09.idav.ucdavis.edu/docs/SA09_NVIDIA_IHV_talk.pdf)... (on page 4 is what I wrote earlier) – BlameTheBits May 17 '18 at 08:39
  • @Shadow Thank you for the help. guess I have read them previously, but I will read it again. – Jaffer Wilson May 17 '18 at 08:59
  • Possible duplicate of [What is the relationship between NVIDIA GPUs' CUDA cores and OpenCL computing units?](https://stackoverflow.com/questions/34259338/what-is-the-relationship-between-nvidia-gpus-cuda-cores-and-opencl-computing-un) – BlameTheBits May 17 '18 at 09:45
  • You are welcome. I just found a very similar question to your and thus marked it as duplicate. There is also a OpenCL programming guide linked (second answer and also mentioned in the first answer of a linked question in the first answer). [Version 4.2](http://developer.download.nvidia.com/compute/DevZone/docs/html/OpenCL/doc/OpenCL_Programming_Guide.pdf) is the newest I can find right now. – BlameTheBits May 17 '18 at 09:49

1 Answers1

1

My question is that, when I am having 384 cores each, then why there are 2 units displayed ?

Easy,
GPU computing devices are different, having other silicon-hardwired architectures, than any universal CPU CISC/RISC computing devices.

The reason WHY is very important here.

GPU devices use Streaming Multiprocessor eXecution units ( SMX units ), that are referred in some hardware-inspection tools.

While the letter M in the SMX abbreviation emphasises, there are multiple executions loadable onto the SMX-unit, yet, all such cases actually do execute ( sure, only if instructed in such a manner, which goes outside of the scope of this topic, to cover / span all over each of the SMX-present SM-cores ) the very same computing instructions - this is the only way they can operate - it is called a SIMD-type of limited scope of parallelism achievable ( co-locally ) on the perimeter of the SMX only, where single-instruction-multiple-data can become executed within a present SIMD-( WARP-wide | half-WARP-wide )-scheduler capabilities.

Having listed those 384 cores, posted above, means a hardware limit, beyond which this co-locally orchestrated SIMD-type of limited-scope parallelism cannot grow, and all attempts into this direction will lead to a pure-[SERIAL] internal scheduling of GPU-jobs ( yes, i.e. one-after-another ).

Understanding these basics is cardinal, as without these architecture features, one may expect a behaviour, that is actually principally impossible to get orchestrated in any whatever kind of the GPGPU system, having a formal shape of [ 1-CPU-host : N-GPU-device(s) ] compositions of autonomous, asynchronous star-of-nodes.

Any GPU-kernel loaded from a CPU-host onto GPU will get mapped onto a non-empty set of SMX-unit(s), where a specified number of cores ( another, finer grain geometry-of-computing resources is applied, again going way beyond the scope of this post ) gets loaded with a stream of SIMD-instructions, not violating the GPU-device limits:

 ...
+----------------------------------------------------------------------------------------
 Max work items dimensions:          3       // 3D-geometry grids possible
    Max work items[0]:               1024    // 1st dimension max.
    Max work items[1]:               1024
    Max work items[2]:               64      // theoretical max. 1024 x 1024 x 64 BUT...
+----------------------------------------------------------------------------------------
 Max work group size:                1024    // actual      max. "geometry"-size
+----------------------------------------------------------------------------------------
 ...

So,

  • if 1-SM-core was internally instructed to execute some GPU-task unit ( a GPU-job ), just this one SM-core will fetch one GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. All the rest of the SM-cores present on the same SMX-unit typically do nothing during that time, until this GPU-job get finished and the internal GPU-process management system decides about mapping some other work for this SMX.

  • if 2-SM-cores were instructed to execute some GPU-job, just this pair of SM-cores will fetch one ( and the very same ) GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and both execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. In this case, if one SM-core gets into a condition, where an if-ed, or similarly branched, flow of execution makes one SM-core into going into another code-execution-flow path than the other, the SIMD-parallelism gets into divergent scenario, where one SM-core gets a next SIMD-instruction, belonging to it's code-execution path, whereas the other one does nothing ( gets a GPU_NOP(s) ), until the first one finished the whole job ( or was enforced to stop at some synchronisation barrier of fell into an unmaskable latency wait-state, when waiting for a piece of data to get fetched from "far" ( slow ) non-local memory location, again, details go way beyond the scope of this post ) - only after any one of this happens, the divergent-path, so far just GPU_NOP-ed SM-core can receive any next SIMD-instruction, belonging to its ( divergent ) code-execution-path to move any forward. All the rest of the SM-cores present on the same SMX-unit typically do nothing during that time, until this GPU-job get finished and the internal GPU-process management system decides about mapping some other work for this SMX.

  • if 16-SM-cores were instructed to execute some GPU-job by the task-specific "geometry", just this "herd" of SM-cores will fetch one ( and the very same ) GPU-RISC-instruction after another ( ignoring any possible ILP for the simplicity here ) and all execute it one at a time, stepping through the stream of SIMD-instructions of the said GPU-job. Any divergence inside the "herd" reduce the SIMD-effect and GPU_NOP-blocked cores remain waiting for the main part of the "herd" to finish the job ( same as was sketched right above this point ).

anyways, all the other SM-cores, not mapped by the task-specific "geometry" on the respective GPU-devices' SMX-unit will typically remain doing nothing usefull at all - so the importance of knowing the hardware details for the proper task-specific "geometry" is indeed important and profiling may help to identify the peak performance for any such GPU-task constellation ( differences may range several orders of magnitude - from best to common to worse - among all possible task-specific "geometry" setups ).


Secondly, when I have many cores, how openCL is distributing the task, is it on each core same process and same data or is it different core with different data ?

As explained in brief above - the SIMD-type device silicon-architecture does not permit any of the SMX SM-cores to execute anything other than the very same SIMD-instruction on the whole "herd"-of-SM-cores, that was mapped by a task-"geometry" onto the SMX-unit ( not counting the GPU_NOP(s) as doing " something else " as it is just wasting CPU:GPU-system time ).

So, yes, " .. on each core same process .. " ( best if never divergent in its internal code-execution paths after if or while or any other kind of code-execution path branching ), so if algorithm, based on data-driven values results in different internal state, each core may have different thread-local-state, based on which the processing may differ ( as exemplified with if-driven divergent code-execution paths above ). More details on SM-local registers, SM-local caching, restricted shared-memory usage ( and latency costs ), GPU-device global-memory usage ( and latency costs and cache-line lengths and associativity for best coalescing access-patterns for latency masking options - many hardware-related + programming eco-system details go into small thousands of pages of hardware + software specific documentation and are well beyond the scope of this simplified for clarity post )

same data or is it different core with different data ?

This is the last, but not least, dilemma - any well parametrised GPU-kernel activation may also pass some amount of external-world data downto the GPU-kernel, which may make SMX thread-local data differend from SM-core to SM-core. Mapping practices and best performance for doing this are principally device specific ( { SMX | SM-registers | GPU_GDDR gloMEM : shaMEM : constMEM | GPU SMX-local cache-hierarchy }-details and capacities

  ...
 +---------------------------------------------------------
  ...                                               901 MHz
  Cache type:                            Read/Write
  Cache line size:                     128
  Cache size:                        32768
  Global memory size:           4294967296
  Constant buffer size:              65536
  Max number of constant args:           9
  Local memory size:                 49152
 +---------------------------------------------------------
  ...                                              4000 MHz
  Cache type:                            Read/Write
  Cache line size:                      64
  Cache size:                       262144
  Global memory size:            536838144
  Constant buffer size:             131072
  Max number of constant args:         480
  Local memory size:                 32768
 +---------------------------------------------------------
  ...                                              1300 MHz
  Cache type:                            Read/Write
  Cache line size:                      64
  Cache size:                       262144
  Global memory size:           1561123226
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory size:                 65536
 +---------------------------------------------------------
  ...                                              4000 MHz
  Cache type:                            Read/Write
  Cache line size:                      64
  Cache size:                       262144
  Global memory size:           2147352576
  Constant buffer size:             131072
  Max number of constant args:         480
  Local memory size:                 32768

are principally so different device to device, that each high-performance code project principally can but profile its respective GPU-device task-"geometry and resources-usage maps composition for actual deployment device. What may work faster on one GPU-device / GPU-drives stack, need not work as smart on another one ( or after GPU-driver + exo-programming eco-system update / upgrade ), simply only the real-life benchmark will tell ( as theory could be easily printed, but hardly as easily executed, as many device-specific and workload-injected limitations will apply in real-life deployment ).

user3666197
  • 1
  • 6
  • 50
  • 92