42

In CUDA we can use pinned memory to more efficiently copy the data from Host to GPU than the default memory allocated via malloc at host. However there are two types of pinned memories the default pinned memory and the zero-copy pinned memory.

The default pinned memory copies the data from Host to GPU twice as fast as the normal transfers, so there's definitely an advantage (provided we have enough host memory to page-lock)

In the different version of pinned memory, i.e. zero-copy memory, we don't need to copy the data from host to GPU's DRAM altogether. The kernels read the data directly from the Host memory.

My question is: Which of these pinned-memory type is a better programming practice.

jwdmsd
  • 2,107
  • 2
  • 16
  • 30

2 Answers2

41

I think it depends on your application (otherwise, why would they provide both ways?)

Mapped, pinned memory (zero-copy) is useful when either:

  • The GPU has no memory on its own and uses RAM anyway

  • You load the data exactly once, but you have a lot of computation to perform on it and you want to hide memory transfer latencies through it.

  • The host side wants to change/add more data, or read the results, while kernel is still running (e.g. communication)

  • The data does not fit into GPU memory

Note that, you can also use multiple streams to copy data and run kernels in parallel.

Pinned, but not mapped memory is better:

  • When you load or store the data multiple times. For example: you have multiple subsequent kernels, performing the work in steps - there is no need to load the data from host every time.

  • There is not that much computation to perform and loading latencies are not going to be hidden well

CygnusX1
  • 20,968
  • 5
  • 65
  • 109
  • 1
    Yes exactly. I found almost the same description in the book 'CUDA by example'. They claim that mapped memory is best when a) your kernels read and write the data exactly once b) when you have an integrated graphics, like ION platform where CPU and GPU share the same memory. – jwdmsd Mar 06 '11 at 13:15
14

Mapped pinned memory is identical to other types of pinned memory in all respects, except that it is mapped into the CUDA address space, so can be read and written by CUDA kernels as well as used for DMA transfers by the Copy Engines.

The advantage to not mapping pinned memory was twofold: it saved you some address space, which can be a precious commodify in a world of 32-bit platforms with GPUs that can hold 3-4G of RAM. Also, memory that is not mapped cannot be accidentally corrupted by rogue kernels. But that concern is esoteric enough that the unified address space feature in CUDA 4.0 will cause all pinned allocations to be mapped by default.

Besides the points raised by the Sanders/Kandrot book, other things to keep in mind:

  • writing to host memory from a kernel (e.g. to post results to the CPU) is nice in that the GPU does not have any latency to cover in that case, and

  • it is VERY IMPORTANT that the memory operations be coalesced - otherwise, even SM 2.x and later GPUs take a big bandwidth hit.

ArchaeaSoftware
  • 4,332
  • 16
  • 21
  • 1
    Could you please expand the last two points? Concerning the first point, what do you mean when you say that "the GPU does not have any latency to cover in that case"? Regarding the second point, why operations in the framework of zero-copy need coalescence? Do they use anyway global memory? – Vitality May 11 '13 at 20:19
  • 2
    If the GPU reads from mapped pinned memory, it has to find something to do until the memory request arrives. If it writes to mapped pinned memory, it posts a write to the bus and moves on. I don't know why they have to be coalesced. Coalescing is a warp-based construct and it must have something to do with the hardware implementation. – ArchaeaSoftware May 13 '13 at 18:55
  • 1
    Coalescing decxreases amount of memory operations, so it became more important for zero-copy memory that is accessed through slow PCI-E bus (compared to GPUs own high-bandwidth global memory) – Bulat Feb 12 '16 at 17:23
  • 1
    The slowness of PCIe a compared to local device memory is exactly why it's a bit surprising that the hardware would care whether the transactions are coalesced. One would think that the L2, which is designed to service traffic to device memory with 10x higher bandwidth, could translate any number of uncoalesced requests into the optimal number of PCIe bus transactions. – ArchaeaSoftware Feb 15 '16 at 20:06
  • 1
    @ArchaeaSoftware: If I am not mistaken, at the moment (2016...) CUDA does not default to making pinned allocations also mapped. Did this change after CUDA 4.0? I suspect that perhaps you misspoke... it's up to the user to decide whether s/he wants mapping or not. – einpoklum Mar 14 '16 at 09:28
  • 1
    When did I say that pinned allocations are also mapped by default? We are talking about how the hardware handles mapped pinned allocations (an opt-in that was added in CUDA 2.2), which enable CUDA kernels to directly access host memory. As of CUDA 4.0, all pinned allocations are, indeed, also mapped on systems that support unified virtual addressing (UVA). You can call cudaGetDeviceProperties() and check cudaDeviceProp::unifiedAddressing to see whether this is happening. – ArchaeaSoftware Mar 15 '16 at 18:02