I'm trying to make automatic decision how to distribute workload between CPU and GPU.
What I wanted to do is to go over all devices and evaluate total theoretical computing power in GFLOPS simply by multiplying:
GFLOPS = clock_speed * number_of_cores
* yes, I know this is extremely crude, as each operation takes different number of clock cycles on each architecture, and there are huge inefficiencies due to cash-misses etc. But still it is some rough estimate of crude computing capabilities.
Now I can get CL_DEVICE_MAX_CLOCK_FREQUENCY
and CL_DEVICE_MAX_COMPUTE_UNITS
form clGetDeviceInfo()
While clock frequnecy 1777 MHz
seems reasonable, the 28
comoute units seems way to low from my NVIDIA GeForce RTX 3060
which have Core Config: 3584 112:48:28:112
according to wikipedia
Looking in the documentation:
CL_DEVICE_MAX_COMPUTE_UNITS
The number of parallel compute units on the OpenCL device. A work-group executes on a single compute unit. The minimum value is 1.
it seems that
GFLOPS = CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_GROUP_SIZE
but this does not fit of Intel CPU where I have CL_DEVICE_MAX_WORK_GROUP_SIZE = 4096
which would make it more powerfull than GPU
I tried to print all possible divice parameters provided by clGetDeviceInfo()
but none seems to give me required information:
OpenCL platform count 2
DEVICE[0,0]: NVIDIA GeForce RTX 3060
VENDOR: NVIDIA Corporation
DEVICE_VERSION: OpenCL 3.0 CUDA
DRIVER_VERSION: 515.86.01
C_VERSION: OpenCL C 1.2
MAX_COMPUTE_UNITS: 28
MAX_CLOCK_FREQUENCY: 1777 MHz
GLOBAL_MEM_SIZE: 12019 MB
LOCAL_MEM_SIZE: 48 kB
CONSTANT_BUFFER_SIZE: 64 kB
GLOBAL_MEM_CACHE_SIZE: 784 kB
GLOBAL_MEM_CACHELINE_SIZE: 128
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_GROUP_SIZE: 1024
MAX_WORK_ITEM_SIZES: [1024,1024,1024]
MIN_DATA_TYPE_ALIGN_SIZE: 128
DEVICE[1,0]: pthread-Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz
VENDOR: GenuineIntel
DEVICE_VERSION: OpenCL 1.2 pocl HSTR: pthread-x86_64-pc-linux-gnu-skylake
DRIVER_VERSION: 1.8
C_VERSION: OpenCL C 1.2 pocl
MAX_COMPUTE_UNITS: 12
MAX_CLOCK_FREQUENCY: 4300 MHz
GLOBAL_MEM_SIZE: 13796 MB
LOCAL_MEM_SIZE: 256 kB
CONSTANT_BUFFER_SIZE: 256 kB
GLOBAL_MEM_CACHE_SIZE: 12288 kB
GLOBAL_MEM_CACHELINE_SIZE: 64
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_GROUP_SIZE: 4096
MAX_WORK_ITEM_SIZES: [4096,4096,4096]
MIN_DATA_TYPE_ALIGN_SIZE: 128
related: OpenCL: Confused by CL_DEVICE_MAX_COMPUTE_UNITS
EDIT:
By reading some materials (wiki, nVidia documentation) I can find that each compute unit has 128 shader cores (or CUDA cores). But this is some vendor specific information which I have to search as a user form external sources. I want to estimate the computing power automatically by the program just from information provided by clGetDeviceInfo()