CUDA: Write directly from device to host pinned memory without sacrificing throughput?

Question

In CUDA, is it possible to write directly to host (pinned) memory from a device kernel?

In my current setup, I first write to device DRAM and then copy from DRAM into host pinned memory. I'm wondering if I can just write directly to host memory (i.e. use one step instead of two) without sacrificing throughput.

From what I understand, unified memory isn't the answer - guides mention that it's slower (perhaps because of its paging semantics?). But I haven't tried it, so perhaps I'm mistaken - maybe there's an option to force everything to reside in host pinned memory?

It is possible to write to pinned memory from a kernel. Pinned memory is not transferred in pages. You have to benchmark your use case to see if it is faster. — Abator Abetor, Oct 27 '22 at 17:35

Robert Crovella · Accepted Answer · 2022-10-28T14:15:07.380

There are numerous question here on the cuda SO tag about how to use pinned memory for "zero-copy" operations. Here is one example. You can find many more examples.

If you only have to write to each output point once, and your writes are/would be nicely coalesced, then there should not be a major performance difference between the costs of:

writing to device memory and then cudaMemcpy D->H after the kernel
writing directly to host-pinned memory

You will still need a cudaDeviceSynchronize() after the kernel call, before accessing the data on the host, to ensure consistency.

Differences on the order of ~10 microseconds are still possible due to CUDA operation overheads.

It should be possible to demonstrate that bulk transfer of data using direct read/writes to pinned memory from kernel code will achieve approximately the same bandwidth as what you would get with a cudaMemcpy transfer.

As an aside, the "paging semantics" of unified memory may be worked around but again a well optimized code in any of these 3 scenarios is not likely to show marked perf or duration differences.

Responding to comments, my use of "approximately" above is probably a stretch, here's a kernel that writes 4GB of data in less than half a second on a PCIE Gen2 system:

$ cat t2138.cu
template <typename T>
__global__ void k(T *d, size_t n){

  for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < n; i+=gridDim.x*blockDim.x)
    d[i] = 0;
}

int main(){

  int *d;
  size_t n = 1048576*1024;
  cudaHostAlloc(&d, sizeof(d[0])*n, cudaHostAllocDefault);
  k<<<160, 1024>>>(d, n);
  k<<<160, 1024>>>(d, n);
  cudaDeviceSynchronize();
  int *d1;
  cudaMalloc(&d1, sizeof(d[0])*n);
  cudaMemcpy(d, d1, sizeof(d[0])*n, cudaMemcpyDeviceToHost);
}
$ nvcc -o t2138 t2138.cu
$ compute-sanitizer ./t2138
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$ nvprof ./t2138
==21201== NVPROF is profiling process 21201, command: ./t2138
==21201== Profiling application: ./t2138
==21201== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   72.48%  889.00ms         2  444.50ms  439.93ms  449.07ms  void k<int>(int*, unsigned long)
                   27.52%  337.47ms         1  337.47ms  337.47ms  337.47ms  [CUDA memcpy DtoH]
      API calls:   60.27%  1.88067s         1  1.88067s  1.88067s  1.88067s  cudaHostAlloc
                   28.49%  889.01ms         1  889.01ms  889.01ms  889.01ms  cudaDeviceSynchronize
                   10.82%  337.55ms         1  337.55ms  337.55ms  337.55ms  cudaMemcpy
                    0.17%  5.1520ms         1  5.1520ms  5.1520ms  5.1520ms  cudaMalloc
                    0.15%  4.6178ms         4  1.1544ms  594.35us  2.8265ms  cuDeviceTotalMem
                    0.09%  2.6876ms       404  6.6520us     327ns  286.07us  cuDeviceGetAttribute
                    0.01%  416.39us         4  104.10us  59.830us  232.21us  cuDeviceGetName
                    0.00%  151.42us         2  75.710us  13.663us  137.76us  cudaLaunchKernel
                    0.00%  21.172us         4  5.2930us  3.0730us  8.5010us  cuDeviceGetPCIBusId
                    0.00%  9.5270us         8  1.1900us     428ns  4.5250us  cuDeviceGet
                    0.00%  3.3090us         4     827ns     650ns  1.2230us  cuDeviceGetUuid
                    0.00%  3.1080us         3  1.0360us     485ns  1.7180us  cuDeviceGetCount
$

4GB/0.44s = 9GB/s

4GB/0.34s = 11.75GB/s (typical for PCIE Gen2 to pinned memory)

We can see that contrary to my previous statement, the transfer of data using in-kernel copying to a pinned allocation does seem to be slower (about 33% slower in my test case) than using a bulk copy (cudaMemcpy DtoH to a pinned allocation). However this isn't quite an apples-to-apples comparison, because the kernel itself would still have to write the 4GB of data to the device allocation to make the comparison to cudaMemcpy be sensible. The speed of this operation will depend on the GPU device memory bandwidth, which varies by GPU of course. So 33% higher is probably "too high" of an estimate of the comparison. But if your GPU has lots of memory bandwidth, this estimate will be pretty close. (On my V100, writing 4GB to device memory only takes ~7ms).

Thanks, I was able to get direct writes working using `cudaHostGetDevicePointer` to get the device pointer from the pinned host pointer. Unfortunately, I did see a throughput drop by about 50%, which is weird given that all my memory accesses are coalesced, I'm writing once, and I have plenty of PCIe bandwidth (I'm writing 4GB in ~1 sec). — emchristiansen, Oct 27 '22 at 20:18
For reference, `busGrind` says I have about 12.85 GB / sec of D2H bandwidth. — emchristiansen, Oct 27 '22 at 20:28

CUDA: Write directly from device to host pinned memory without sacrificing throughput?

1 Answers1