17

There is a lot of advice on when to use inline functions and when to avoid it in regular C coding. What is the effect of __forceinline__ on CUDA C __device__ functions? Where should they be used and where be avoided?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Farzad
  • 3,288
  • 2
  • 29
  • 53

1 Answers1

11

Normally the nvcc device code compiler will make it's own decisions about when to inline a particular __device__ function and generally speaking, you probably don't need to worry about overriding that with the __forceinline__ decorator/directive.

cc 1.x devices don't have all the same hardware capabilities as newer devices, so very often the compiler will automatically inline functions for those devices.

I think the reason to specify __forceinline__ is the same as what you may have learned about host C code. It is usually used for optimization when the compiler might not otherwise inline the function (e.g. on cc 2.x or newer devices). This optimization (i.e. function call overhead) might be negligible if you were calling the function once, but if you were calling the function in a loop for example, making sure it was inlined might give noticeable improvement in code execution.

As a counter example, inlining and recursion generally have contra-indications. For a recursive function that calls itself, I don't think it's possible to handle arbitrary recursion and also strict inlining. So if you intend to use a function recursively (supported in cc 2.x and above) you probably wouldn't want to specify __forceinline__.

In general, I think you should let the compiler manage this for you. It will intelligently decide whether to inline a function.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 3
    In expression templates, you want to achieve the same performance as hand-written code by using a simpler and natural mathematical syntax. For this case, I needed to use `__forceinline__` (and the host `__forceinline` counterpart) to guarantee that. – Vitality Nov 11 '13 at 06:32