0

I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic parallelism to launch successive kernels from the initKernel, two of which are thrust::sort(thrust::device,...).

enter image description here

Before launching the initKernel, I do a cudaMalloc() on the host code and it is shown in the Runtime API of the Visual profiler. None of the cudaMallocs that appear in the __device__ functions and successive kernels (after the launch of initKernel) are shown in the Runtime API of the Visual profiler. Can someone help me understand why I cannot see the cudaMallocs in the Visual profiler?

Thank you for your time.

progammer
  • 1,951
  • 11
  • 28
  • 50

1 Answers1

2

Can someone help me understand why I cannot see the cudaMallocs in the Visual profiler?

Because it is a documented limitation of the tool. From the documentation:

The Visual Profiler timeline does not display CUDA API calls invoked from within device-launched kernels.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks for the information. Can you suggest a tip (or technique) to identify such cudaMalloc() related hotspots? For example, when I use thrust::sort() from the device, I realize that on every call to that thrust::sort() kernel, cudaMalloc() is being called to allocate a large temporary buffer. Is there a way to find out the time taken by these particular cudaMallocs when thrust::sort() is called from the device code? – progammer Oct 23 '18 at 14:07
  • both thrust and cub provide for the possibility of using your own allocator. It's mandatory for cub, and optional for thrust. If you don't like the `cudaMalloc` behavior of thrust, you could investigate using thrust with a custom allocator or just using cub for device code sorting. I don't know of a way to make device API cudaMalloc appear in the visual profiler timeline, just like I don't know of a way to make any other device level instruction appear. You can use [PC sampling](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#pc-sampling) to get an idea, however. – Robert Crovella Oct 24 '18 at 14:41
  • Yes, I investigated the use of a custom allocator in thrust::sort(), a couple of days ago. The examples concerning custom allocator that I find on the internet give me several compile time errors (when integrated in my Visual Studio project). Apparently, it is not that straightforward. I cannot use cub as I need to sort an array of structs. Thanks for the link. – progammer Oct 24 '18 at 15:36