To avoid really long and incohesive functions I am calling
a number of __device__
functions from a kernel. I allocate a shared
buffer at the beginning of the kernel call (which is per-thread-block)
and pass pointers to it to all the __device__
functions that are
performing some processing steps in the kernel.
I was wondering about the following:
If I allocate a shared memory buffer in a __global__
function,
how can other __device__
functions that I pass a pointer distinguish between the possible address types (global device or shared memory) that the pointer could refer to?
Note that it is invalid to decorate the formal parameters with a __shared__
modifier
according to the CUDA programming guide. The only way it could be implemented IMHO is by
a) putting markers on the allocated memory.
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for global and shared memory and a threshold check on the pointer can be used?
So my question is: Do I need to worry about it or how should one proceed alternatively without inlining all functions into the main kernel?
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called
'external calls from global functions', requiring them to be inlined. This means in effect
I have to declare all ___device___
functions inline and the separation of header / source
files is broken. This is of course quite ugly, but is there an alternative?