I've got an already parallelized CUDA kernel that does some tasks which require frequent interpolation.
So there's a kernel
__global__ void complexStuff(...)
which calls one or more times this interpolation device function:
__device__ void interpolate(...)
The interpolation algorithm does an WENO interpolation successively over three dimensions. This is a highly parallelizable task which I urgently would like to parallelize!
It is clear that the kernel complexStuff()
can easily be parallelized by calling it from host code using the <<<...>>>
syntax. It is also important that complexStuff()
is already parallelized.
But it's not clear to me how to parallelize something / create new threads from inside a CUDA device function ... is this even possible? Does anyone know?