Nvidia visual profiler not showing cudaMalloc() after kernel launch

Question

I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic parallelism to launch successive kernels from the initKernel, two of which are thrust::sort(thrust::device,...).

Before launching the initKernel, I do a cudaMalloc() on the host code and it is shown in the Runtime API of the Visual profiler. None of the cudaMallocs that appear in the __device__ functions and successive kernels (after the launch of initKernel) are shown in the Runtime API of the Visual profiler. Can someone help me understand why I cannot see the cudaMallocs in the Visual profiler?

Thank you for your time.

score 2 · Accepted Answer · answered Oct 23 '18 at 13:22

2

Can someone help me understand why I cannot see the cudaMallocs in the Visual profiler?

Because it is a documented limitation of the tool. From the documentation:

The Visual Profiler timeline does not display CUDA API calls invoked from within device-launched kernels.

answered Oct 23 '18 at 13:22

Robert Crovella

143,785
11
213
257

Thanks for the information. Can you suggest a tip (or technique) to identify such cudaMalloc() related hotspots? For example, when I use thrust::sort() from the device, I realize that on every call to that thrust::sort() kernel, cudaMalloc() is being called to allocate a large temporary buffer. Is there a way to find out the time taken by these particular cudaMallocs when thrust::sort() is called from the device code? – progammer Oct 23 '18 at 14:07
both thrust and cub provide for the possibility of using your own allocator. It's mandatory for cub, and optional for thrust. If you don't like the `cudaMalloc` behavior of thrust, you could investigate using thrust with a custom allocator or just using cub for device code sorting. I don't know of a way to make device API cudaMalloc appear in the visual profiler timeline, just like I don't know of a way to make any other device level instruction appear. You can use [PC sampling](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#pc-sampling) to get an idea, however. – Robert Crovella Oct 24 '18 at 14:41
Yes, I investigated the use of a custom allocator in thrust::sort(), a couple of days ago. The examples concerning custom allocator that I find on the internet give me several compile time errors (when integrated in my Visual Studio project). Apparently, it is not that straightforward. I cannot use cub as I need to sort an array of structs. Thanks for the link. – progammer Oct 24 '18 at 15:36

Nvidia visual profiler not showing cudaMalloc() after kernel launch

1 Answers1