0

I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be called from a device kernel. The following code was tried with CUDA 10.1 on Turing architecture.


__global__ void async_scatter_kernel(float* first,
    float* last,
    int* map,
    float* output)
{
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
    thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
    cudaDeviceSynchronize();
    cudaStreamDestroy(stream);
}

I know thrust uses dynamic parallelism to launch its kernels when called from the device, however I couldn't find a way to specify the stream.

heapoverflow
  • 264
  • 2
  • 12
  • 2
    It will be asynchronous from the calling thread even if you don't specify a stream and just use `thrust::device` execution policy (and provide the necessary compilation environment and run environment for dynamic parallelism) – Robert Crovella Sep 24 '19 at 13:53
  • @RobertCrovella I also want my kernel to run concurrently with other kernels since I have/require multi-level concurrency. As far as I know, `thrust::device` execution policy runs on the null stream. I couldn't profile the kernel to see the behavior, since visual profiler doesn't support dynamic parallelism for CC 7.0 or later. I think, to achieve concurrency with other kernels I need to be able to launch it in a stream other than the null stream. – heapoverflow Sep 25 '19 at 08:38
  • Your code compiles cleanly for me on CUDA 10.1.243. My guess would be that your compilation command line (which you haven't shown) is not correctly specifying the necessary environment for CUDA Dynamic Parallelism compilation. – Robert Crovella Sep 25 '19 at 11:58

1 Answers1

1

The following code compiles cleanly for me on CUDA 10.1.243:

$ cat t1518.cu
#include <thrust/scatter.h>
#include <thrust/execution_policy.h>

__global__ void async_scatter_kernel(float* first,
    float* last,
    int* map,
    float* output)
{
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
    thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
    cudaDeviceSynchronize();
    cudaStreamDestroy(stream);
}

int main(){

  float *first = NULL;
  float *last = NULL;
  float *output = NULL;
  int *map = NULL;
  async_scatter_kernel<<<1,1>>>(first, last, map, output);
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -rdc=true t1518.cu -o t1518
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$

The -arch=sm_35 (or similar) and -rdc=true are necessary (but not in all cases sufficient) compile switches for any code that uses CUDA Dynamic Parallelism. If you omit, for example, the -rdc=true switch, you get an error similar to what you describe:

$ nvcc -arch=sm_35 t1518.cu -o t1518
t1518.cu(11): error: calling a __host__ function("thrust::cuda_cub::par_t::on const") from a __global__ function("async_scatter_kernel") is not allowed

t1518.cu(11): error: identifier "thrust::cuda_cub::par_t::on const" is undefined in device code

2 errors detected in the compilation of "/tmp/tmpxft_00003a80_00000000-8_t1518.cpp1.ii".
$

So, for the example you have shown here, your compilation error can be eliminated either by updating to the latest CUDA version or by specifying the proper command line, or both.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • The problem was indeed `-rdc=true`. But, I can't profile the application when there is dynamic parallelism, neither with Visual Profiler nor with Nsight Systems. I can only see memory copies. Is there a way to trace the kernels with dynamic parallelism? – heapoverflow Sep 27 '19 at 14:27