How to estimate GPU performance using clGetDeviceInfo()

Question

I'm trying to make automatic decision how to distribute workload between CPU and GPU.

What I wanted to do is to go over all devices and evaluate total theoretical computing power in GFLOPS simply by multiplying:

GFLOPS = clock_speed * number_of_cores

^{* yes, I know this is extremely crude, as each operation takes different number of clock cycles on each architecture, and there are huge inefficiencies due to cash-misses etc. But still it is some rough estimate of crude computing capabilities.}

Now I can get CL_DEVICE_MAX_CLOCK_FREQUENCY and CL_DEVICE_MAX_COMPUTE_UNITS form clGetDeviceInfo()

While clock frequnecy 1777 MHz seems reasonable, the 28 comoute units seems way to low from my NVIDIA GeForce RTX 3060 which have Core Config: 3584 112:48:28:112 according to wikipedia

Looking in the documentation:

CL_DEVICE_MAX_COMPUTE_UNITS
The number of parallel compute units on the OpenCL device. A work-group executes on a single compute unit. The minimum value is 1.

it seems that

GFLOPS = CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_GROUP_SIZE

but this does not fit of Intel CPU where I have CL_DEVICE_MAX_WORK_GROUP_SIZE = 4096 which would make it more powerfull than GPU

I tried to print all possible divice parameters provided by clGetDeviceInfo() but none seems to give me required information:

OpenCL platform count 2 
DEVICE[0,0]: NVIDIA GeForce RTX 3060
        VENDOR:         NVIDIA Corporation
        DEVICE_VERSION: OpenCL 3.0 CUDA
        DRIVER_VERSION: 515.86.01
        C_VERSION:      OpenCL C 1.2 
        MAX_COMPUTE_UNITS:    28
        MAX_CLOCK_FREQUENCY:  1777  MHz 
        GLOBAL_MEM_SIZE:      12019 MB  
        LOCAL_MEM_SIZE:       48 kB  
        CONSTANT_BUFFER_SIZE: 64 kB  
        GLOBAL_MEM_CACHE_SIZE:     784 kB 
        GLOBAL_MEM_CACHELINE_SIZE: 128 
        MAX_WORK_ITEM_DIMENSIONS: 3  
        MAX_WORK_GROUP_SIZE:      1024 
        MAX_WORK_ITEM_SIZES:      [1024,1024,1024] 
        MIN_DATA_TYPE_ALIGN_SIZE: 128    
        
DEVICE[1,0]: pthread-Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz
        VENDOR:         GenuineIntel
        DEVICE_VERSION: OpenCL 1.2 pocl HSTR: pthread-x86_64-pc-linux-gnu-skylake
        DRIVER_VERSION: 1.8
        C_VERSION:      OpenCL C 1.2 pocl 
        MAX_COMPUTE_UNITS:    12
        MAX_CLOCK_FREQUENCY:  4300  MHz 
        GLOBAL_MEM_SIZE:      13796 MB  
        LOCAL_MEM_SIZE:       256 kB  
        CONSTANT_BUFFER_SIZE: 256 kB  
        GLOBAL_MEM_CACHE_SIZE:     12288 kB 
        GLOBAL_MEM_CACHELINE_SIZE: 64 
        MAX_WORK_ITEM_DIMENSIONS: 3  
        MAX_WORK_GROUP_SIZE:      4096 
        MAX_WORK_ITEM_SIZES:      [4096,4096,4096] 
        MIN_DATA_TYPE_ALIGN_SIZE: 128

related: OpenCL: Confused by CL_DEVICE_MAX_COMPUTE_UNITS

EDIT:

By reading some materials (wiki, nVidia documentation) I can find that each compute unit has 128 shader cores (or CUDA cores). But this is some vendor specific information which I have to search as a user form external sources. I want to estimate the computing power automatically by the program just from information provided by clGetDeviceInfo()

Does this answer your question? [What is the relationship between NVIDIA GPUs' CUDA cores and OpenCL computing units?](https://stackoverflow.com/questions/34259338/what-is-the-relationship-between-nvidia-gpus-cuda-cores-and-opencl-computing-un) — Joachim Sauer, Mar 27 '23 at 07:38
Also [this one](https://stackoverflow.com/questions/31009499/confusion-over-compute-units-and-expected-cores-on-nvidia-gpu) and [this one](https://stackoverflow.com/questions/34259338/what-is-the-relationship-between-nvidia-gpus-cuda-cores-and-opencl-computing-un). Basically the NVidia name for CL compute units is Streaming Multiprocessors (SMs) and the RTX 3060 is specced to have 28 of those. — Joachim Sauer, Mar 27 '23 at 07:41
yes, I found this on the wiki page (each compute unit is 128 shader cores). But the question is how I can do this automatically without having any external knowledge (i.g. going to wikipedia) just form output of `clGetDeviceInfo()` — Prokop Hapala, Mar 27 '23 at 07:58
Do you need "computing power" in GFLOPS specifically? Since you only get a rough estimate anyway, why don't you just "accept" each vendors specification of compute unit, multiply that with frequency and use the result as an (effectively unit-less) rough estimate of performance? — Joachim Sauer, Mar 27 '23 at 08:21
My old Tesla K40 has 192 shaders per compute unit. AMD cards of the same era have 64. They're not all 128. — Simon Goater, Mar 27 '23 at 10:22
@JoachimSauer Because to compare between different hardware I need to know how many floating-point operation each *compute unit* can do per clock cycle. For example intel CPU with [AVX-512](https://en.wikipedia.org/wiki/AVX-512) can do 16 FLOPs per cycle per compute unit, nVidia GTX 3060 can do 256 FLOPs per cycle per compute unit. By multyplying 12 compute units * 4.3 GHz = 51.6 GFLOPs for my CPU (true is 16x higer 825.6 GFlops), for my GPU 28 compute units * 1.777 GHz = 50 GFLOPs (true is 256x higher 12.74 TFLOPs) — Prokop Hapala, Mar 27 '23 at 13:18

ProjectPhysX · Accepted Answer · 2023-03-28T20:21:24.390

There is 2 specs that mainly determine the performance of a GPU:

FP32 TFlops (in some cases also FP64/FP16 TFlops)
memory bandwidth (in some cases also cache bandwidth)

OpenCL clGetDeviceInfo provides neither of these. But: You get the number of CUs CL_DEVICE_MAX_COMPUTE_UNITS, the peak core clock speed according to the data sheet CL_DEVICE_MAX_CLOCK_FREQUENCY, and the instructions per cycle (ipc), which can give you at least a good estimate for FP32 TFlops:

int compute_units = (uint)cl_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
int clock_frequency = (uint)cl_device.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
int ipc = ipc = cl_device.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU ? 2 : 32; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs

// int cores_per_cu = ?;

int cores = compute_units*cores_per_cu;
float tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency;

There is one remaining unknown: the number of cores per CU. This depends on the GPU microarchitecture and can be:

1/2 (CPUs with hyperthreading)
1 (CPUs without HT)
8 (Intel iGPUs/dGPUs, ARM GPUs)
64 (Nvidia P100/Volta/Turing/A100/A30, AMD GCN/CDNA)
128 (Nvidia Maxwell/Pascal/Ampere/Hopper/Ada, AMD RDNA/RDNA2)
192 (Nvidia Kepler)
256 (AMD RDNA3)

You can figure this out with some checks for the device vendor+name, which will reveal the microarchitecture:

string name = trim(cl_device.getInfo<CL_DEVICE_NAME>()); // device name
string vendor = trim(cl_device.getInfo<CL_DEVICE_VENDOR>()); // device vendor

bool nvidia_192_cores_per_cu = contains_any(to_lower(name), {"gt 6", "gt 7", "gtx 6", "gtx 7", "quadro k", "tesla k"}) || (clock_frequency<1000u&&contains(to_lower(name), "titan")); // identify Kepler GPUs
bool nvidia_64_cores_per_cu = contains_any(to_lower(name), {"p100", "v100", "a100", "a30", " 16", " 20", "titan v", "titan rtx", "quadro t", "tesla t", "quadro rtx"}) && !contains(to_lower(name), "rtx a"); // identify P100, Volta, Turing, A100, A30
bool amd_128_cores_per_dualcu = contains(to_lower(name), "gfx10"); // identify RDNA/RDNA2 GPUs where dual CUs are reported
bool amd_256_cores_per_dualcu = contains(to_lower(name), "gfx11"); // identify RDNA3 GPUs where dual CUs are reported

float cores_per_cu_nvidia = (float)(contains(to_lower(vendor), "nvidia"))*(nvidia_64_cores_per_cu?64.0f:nvidia_192_cores_per_cu?192.0f:128.0f); // Nvidia GPUs have 192 cores/CU (Kepler), 128 cores/CU (Maxwell, Pascal, Ampere, Hopper, Ada) or 64 cores/CU (P100, Volta, Turing, A100, A30)
float cores_per_cu_amd = (float)(contains_any(to_lower(vendor), {"amd", "advanced"}))*(is_gpu?(amd_256_cores_per_dualcu?256.0f:amd_128_cores_per_dualcu?128.0f:64.0f):0.5f); // AMD GPUs have 64 cores/CU (GCN, CDNA), 128 cores/dualCU (RDNA, RDNA2) or 256 cores/dualCU (RDNA3), AMD CPUs (with SMT) have 1/2 core/CU
float cores_per_cu_intel = (float)(contains(to_lower(vendor), "intel"))*(is_gpu?8.0f:0.5f); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs (with HT) have 1/2 core/CU
float cores_per_cu_apple = (float)(contains(to_lower(vendor), "apple"))*(128.0f); // Apple ARM GPUs usually have 128 cores/CU
float cores_per_cu_arm = (float)(contains(to_lower(vendor), "arm"))*(is_gpu?8.0f:1.0f); // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU

float cores_per_cu = cores_per_cu_nvidia+cores_per_cu_amd+cores_per_cuintel+cores_per_cu_apple+cores_per_cu_arm; // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)

Find the full source code here. This estimate is correct for the vast majority of CPUs and GPUs. However there are some notable exceptions:

CPUs without hyperthreading will be detected with only half of their cores.
The CPU ipc=32 is only correct for CPUs supporting AVX2, which is the vast majority of modern CPUs. Very old CPUs may only support AVX and have ipc=16, and some HEDT CPUs support AVX512 and have ipc=64.
Some GPUs have the same name, but can be 2 different microarchitectures, for example the GTX 860M (Kepler or Maxwell). These cannot be easily distinguished and a more advanced lookup table would be necessary.

There is no way to figure out the memory bandwidth unfortunately. For this, you'd either have to do an extended lookup table containing hundreds of GPUs, or do a quick benchmark run.

Thank you for comperhansive answer. It is a shame that such crucial information as `cores_per_cu` cannot be obtained drectly from OpenCL API, but your analysis is very helpful! — Prokop Hapala, Mar 28 '23 at 03:29

How to estimate GPU performance using clGetDeviceInfo()

1 Answers1