There are numerous question here on the cuda
SO tag about how to use pinned memory for "zero-copy" operations. Here is one example. You can find many more examples.
If you only have to write to each output point once, and your writes are/would be nicely coalesced, then there should not be a major performance difference between the costs of:
- writing to device memory and then cudaMemcpy D->H after the kernel
- writing directly to host-pinned memory
You will still need a cudaDeviceSynchronize()
after the kernel call, before accessing the data on the host, to ensure consistency.
Differences on the order of ~10 microseconds are still possible due to CUDA operation overheads.
It should be possible to demonstrate that bulk transfer of data using direct read/writes to pinned memory from kernel code will achieve approximately the same bandwidth as what you would get with a cudaMemcpy
transfer.
As an aside, the "paging semantics" of unified memory may be worked around but again a well optimized code in any of these 3 scenarios is not likely to show marked perf or duration differences.
Responding to comments, my use of "approximately" above is probably a stretch, here's a kernel that writes 4GB of data in less than half a second on a PCIE Gen2 system:
$ cat t2138.cu
template <typename T>
__global__ void k(T *d, size_t n){
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < n; i+=gridDim.x*blockDim.x)
d[i] = 0;
}
int main(){
int *d;
size_t n = 1048576*1024;
cudaHostAlloc(&d, sizeof(d[0])*n, cudaHostAllocDefault);
k<<<160, 1024>>>(d, n);
k<<<160, 1024>>>(d, n);
cudaDeviceSynchronize();
int *d1;
cudaMalloc(&d1, sizeof(d[0])*n);
cudaMemcpy(d, d1, sizeof(d[0])*n, cudaMemcpyDeviceToHost);
}
$ nvcc -o t2138 t2138.cu
$ compute-sanitizer ./t2138
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$ nvprof ./t2138
==21201== NVPROF is profiling process 21201, command: ./t2138
==21201== Profiling application: ./t2138
==21201== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 72.48% 889.00ms 2 444.50ms 439.93ms 449.07ms void k<int>(int*, unsigned long)
27.52% 337.47ms 1 337.47ms 337.47ms 337.47ms [CUDA memcpy DtoH]
API calls: 60.27% 1.88067s 1 1.88067s 1.88067s 1.88067s cudaHostAlloc
28.49% 889.01ms 1 889.01ms 889.01ms 889.01ms cudaDeviceSynchronize
10.82% 337.55ms 1 337.55ms 337.55ms 337.55ms cudaMemcpy
0.17% 5.1520ms 1 5.1520ms 5.1520ms 5.1520ms cudaMalloc
0.15% 4.6178ms 4 1.1544ms 594.35us 2.8265ms cuDeviceTotalMem
0.09% 2.6876ms 404 6.6520us 327ns 286.07us cuDeviceGetAttribute
0.01% 416.39us 4 104.10us 59.830us 232.21us cuDeviceGetName
0.00% 151.42us 2 75.710us 13.663us 137.76us cudaLaunchKernel
0.00% 21.172us 4 5.2930us 3.0730us 8.5010us cuDeviceGetPCIBusId
0.00% 9.5270us 8 1.1900us 428ns 4.5250us cuDeviceGet
0.00% 3.3090us 4 827ns 650ns 1.2230us cuDeviceGetUuid
0.00% 3.1080us 3 1.0360us 485ns 1.7180us cuDeviceGetCount
$
4GB/0.44s = 9GB/s
4GB/0.34s = 11.75GB/s (typical for PCIE Gen2 to pinned memory)
We can see that contrary to my previous statement, the transfer of data using in-kernel copying to a pinned allocation does seem to be slower (about 33% slower in my test case) than using a bulk copy (cudaMemcpy DtoH to a pinned allocation). However this isn't quite an apples-to-apples comparison, because the kernel itself would still have to write the 4GB of data to the device allocation to make the comparison to cudaMemcpy be sensible. The speed of this operation will depend on the GPU device memory bandwidth, which varies by GPU of course. So 33% higher is probably "too high" of an estimate of the comparison. But if your GPU has lots of memory bandwidth, this estimate will be pretty close. (On my V100, writing 4GB to device memory only takes ~7ms).