0

I have a RX 570, These are the information i received from clGetDeviceInfo

MaxComputeUnitPerGPU: 32

MaxWorkGroupSize: 256

MaxWorkItemSize: 256

MaxGlobalMemoryOfDevice: 4294967296

MaxPrivateMemoryBytesPerWorkGroup: 16384

MaxLocalMemoryBytesPerWorkGroup: 32768

If I have 256 Work Groups and 256 Work Items per Work Group It would mean that

64 Bytes Of Private(l1?) Memory per work Item(16384/256)
32768 Bytes Of Local(l2) Memory per work Group

And if I use 17 floats would it overflow to L2?

or

If I use 15 float, and 2 private float would it overflow to L2?

also is float the same as private float? Answer: Same by default, By @doqtor

or

If I use 16 float and use functions like pow, sqrt and clamp would registry(l1?) overflow occur?

Punal Manalan
  • 59
  • 2
  • 6

2 Answers2

1

Variables without address specifier are by default private. By OpenCL docs:

Variables inside a __kernel function not declared with an address space qualifier, all variables inside non-kernel functions, and all function arguments are in the __private or private address space. Variables declared as pointers are considered to point to the __private address space if an address space qualifier is not specified.

Private variables are stored in registers on GPU. If the kernel uses more registers than available, some variables are stored instead in global memory (register spilling).

doqtor
  • 8,414
  • 2
  • 20
  • 36
1

To add to doqtor's answer, you can detect register spilling by doing roofline analysis if you are in the bandwidth limit. You can count the number of FLOPs and memory transfers from the program binaries (string binaries = program.getInfo<CL_PROGRAM_BINARIES>()[0]);). If you are very close to the bandwidth limit, then there is no spilling. If you increase the number of private variables from this point, for example with a matrix multiplication in private memory, and performance significantly drops, then you have a register spill: private variables are suddenly read from global memory and since you already were in the bandwidth limit, the additional global memory access leads to slowdown.

ProjectPhysX
  • 4,535
  • 2
  • 14
  • 34
  • 1
    Once again thanks for helping me on this subject too! but can you please tell me more about *getInfo* in terms of *clGetProgramInfo* I don't know how to get the "binary" when i use *clGetProgramInfo* as for *getinfo* i don't have code for this – Punal Manalan Apr 25 '21 at 10:50
  • See here: https://stackoverflow.com/a/7338930/9178992 Alternatively, you can also count FLOPs and memory access directly in the OpenCL C code. FLOPs don't not have to be exact, the order of magnitude for the arithmetic intensity is enough. – ProjectPhysX Apr 25 '21 at 11:03
  • The said Code in the link, Has Mistmatched parameters... and i do not know match the parameters correctly... This linke for example, `errcode = clGetProgramInfo(program, CL_PROGRAM_BINARY_SIZES, binary_sizes, number_of_binaries * sizeof(int), &number_of_binaries);` – Punal Manalan Apr 25 '21 at 11:16