I am running host to device bandwidthtests for different sizes of data, and have noticed an increased bandwidth when the host memory is pinned against pageable. Following is my plot of bandwidth in MB/s vs data transfer size in bytes. One could notice that for small amount of data (<300K) pageable fares better than pinned...is it related to memory allocation by the O/S? This bandwidthtest program is from NVidia's code sample sdk (with slight modifications from my side), and I am testing against Tesla C2050 using CUDA 4.0. The O/S is 64-bit Linux.
2 Answers
The cudaMemcpy implementation has different code paths for different devices, source and destination locations, and data sizes, in order to try to achieve the best possible throughput.
The different rates you are seeing are probably due to the implementation switching as the array size changes.
For example, Fermi GPUs have both dedicated copy engines (which can run in parallel with kernels running on the SMs), and SMs which can access host memory over PCI-e. For smaller arrays, it may be more efficient for cudaMemcpy to be implemented as a kernel running on the SMs that reads host memory directly, and stores the loaded data in device memory (or vice versa), so the driver may choose to do this. Or it may be more efficient to use the copy engine -- I'm not sure which it does in practice, but I think switching between them is the cause of the crossover you see in your graph.

- 26,505
- 2
- 57
- 88
It is possible that test is cheating.
Here is one of timed code:
cutilSafeCall( cudaEventRecord( start, 0 ) );
if( PINNED == memMode )
{
for( unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++ )
{
cutilSafeCall( cudaMemcpyAsync( h_odata, d_idata, memSize,
cudaMemcpyDeviceToHost, 0) );
}
}
else
{
for( unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++ )
{
cutilSafeCall( cudaMemcpy( h_odata, d_idata, memSize,
cudaMemcpyDeviceToHost) );
}
}
cutilSafeCall( cudaEventRecord( stop, 0 ) );
Note, that test uses different functions to do a MemCPY for different kinds of memory. I think, this is a cheating, because main difference between modes is how the memory is allocated, with cudaHostAlloc for pinned and with malloc for unpinned.
Different Memcpy functions can have vary paths of error checking and transfer setup.
So, try to modify the test and do copy in both modes with cudaMemcpy()
, e.g. with changing all ifs after cudeEventRecord(...)
to if( 0 && (PINNED == memMode) )
-
Thank you for explaining, but I am using `cudaMemcpy(...)` for both pinned and pageable allocations. – Sayan Aug 03 '11 at 23:33
-
4The code above is from the test, but what you left out is that after the last line, it calls cudaDeviceSynchronize() before stopping the host timer. This ensures the entire copy cost is timed, and bandwidthTest does not "cheat". – harrism Aug 04 '11 at 00:58
-
cheating is not in the cudaDeviceSynchronize (waiting a async operation to end), but in using different copy functions (asyncmemcpy may have different setup time.) – osgx Aug 04 '11 at 01:22
-
2Yes I ran the 'cheat' test, using cudamemCpy for Pageable and cudamemCpyAsync (with cudaDeviceSynchronize() before the timer is stopped), and the results are exactly the same. – Sayan Aug 04 '11 at 15:22