OpenCL How efficient is the use of user-defined function in kernel code in terms of memory and performance

Question

In OpenCL C kernel code, Default built-in functions are good, but what about user-defined functions? do they have any performance and memory decrease when compared with in-built ones? If so, does writing the said user-defined function inside __kernel void once or multiple times better?

For Example:-

gentype clamp ( gentype x,
gentype minval,
gentype maxval)

The Above is an In-built function that has no impact on Performance nor does it reduce gpu l0/l1 cache memory

By user-defined function I mean like this below

int Add(int a, int b)
{
   return a + b;
}

do these functions have any impact on l0/l1 memory if so then is it better to Not write these as functions and instead use the code everywhere?

Can you clarify your question with a well-defined example? To me the current question make no sens. What do you means by "user-defined functions". — Jérôme Richard, Jun 19 '21 at 16:58
@JérômeRichard Thanks for commenting! let me update my question — Punal Manalan, Jun 19 '21 at 17:00
The `Add` will likely be inlined at compile time (AFAIK it is done on Nvidia platforms) so there is likely no overhead in practice. You can give some hint using the `inline` keyword or like in the proposed answer. A built-in is not necessary a low-level instruction and so "built-ins" should be as fast as "user-defined functions". However, keep in mind that the implementation of built-ins are generally smart and carefully optimized for the target platform. — Jérôme Richard, Jun 19 '21 at 17:46
@JérômeRichard Thanks for Info! I will inline functions then! — Punal Manalan, Jun 19 '21 at 18:04

ProjectPhysX · Accepted Answer · 2021-07-06T22:45:53.887

3

I usually inline all functions, except if they are very lengthy and are called many times within a kernel. For example

float __attribute__((always_inline)) sq(const float x) {
    return x*x;
}

for computing the square of x. Inlined functions come at no additional computational cost for the function calling itself. However if you inline a very long function many times in a kernel, the assembly blows up and spills into global memory, resultuing in a loss of performance. In this case, the overhead due to function call is negligible compared to the execution time of the function itself. Finally, if you don't explicitely inline a very short function, the compiler will do it automatically in most cases. Same as for functions is true for loop unrolling with #pragma unroll.

Regarding the math functions, most of them directly relate to the hardware, with a few exceptions. For example, the count leading zeroes function int y = clz(x);, despite being translated into the clz PTX instruction, has no dedicated hardware and is slower than emulating it with int y = 31-(int)(as_uint((float)x)>>23);. Similarly, although the inverse square root rsqrt(x) is executed in hardware,

float __attribute__((always_inline)) fast_rsqrt(const float x) {
    return as_float(0x5F37642F-(as_int(x)>>1));
}

runs slightly faster but is less accurate. In most cases the built-in math functions are the best option though.

edited Jul 06 '21 at 22:45

answered Jun 19 '21 at 17:08

ProjectPhysX

4,535
2
14
34

1

Thanks Once again for answering many of my questions! Also what does "function many times in a kernel" Does it mean calling the function many times in the code? – Punal Manalan Jun 19 '21 at 17:25
Also since inline where ever it is called puts a copy of itself, would it not be better to use Macros? – Punal Manalan Jun 19 '21 at 17:28
1

With "function many times in a kernel" I mean, when the function is about 100 lines or longer and in the kernel it is called about 10 times or more often, I wouldn't inline. Inlining itself works just like a macro, copying the function part in the {} brackets right where the function is called. But inlining is safer to use than macros; excessive macro usage can do nasty things. – ProjectPhysX Jun 19 '21 at 19:46
1

Once again thank you very much! But if My kernel somehow got 100% registry usage, would the inline overflow to global? – Punal Manalan Jun 20 '21 at 04:19
1

If you use too many private variables *at once*, like for example by having a 30x30 float matrix in private memory space, registers might spill to global memory at loss of performance. If the kernel just is lengthy, the same registers are used over and over again and there is no spilling. The register spilling can happen independently if you inline or not. – ProjectPhysX Jun 20 '21 at 05:18
1

Thanks a lot! All my questions on this matter has now be cleared! – Punal Manalan Jun 20 '21 at 05:29

OpenCL How efficient is the use of user-defined function in kernel code in terms of memory and performance

1 Answers1