8

Pinned memory is supposed to increase transfer rates from host to device (api reference). However I found that I do not need to call cuMemcpyHtoD for the kernel to access the values, or cuMemcpyDtoA for the host to read the values back. I didn't think this would work, but it does:

__global__ void testPinnedMemory(double * mem)
{
    double currentValue = mem[threadIdx.x];
    printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
    mem[threadIdx.x] = currentValue+10;
}

void test() 
{
    const size_t THREADS = 8;
    double * pinnedHostPtr;
    cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);

    //set memory values
    for (size_t i = 0; i < THREADS; ++i)
        pinnedHostPtr[i] = i;

    //call kernel
    dim3 threadsPerBlock(THREADS);
    dim3 numBlocks(1);
    testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);

    //read output
    printf("Data after kernel execution: ");
    for (int i = 0; i < THREADS; ++i)
        printf("%f ", pinnedHostPtr[i]);    
    printf("\n");
}

Output:

Data after kernel execution: 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000
Thread id: 0, memory content: 0.000000
Thread id: 1, memory content: 1.000000
Thread id: 2, memory content: 2.000000
Thread id: 3, memory content: 3.000000
Thread id: 4, memory content: 4.000000
Thread id: 5, memory content: 5.000000
Thread id: 6, memory content: 6.000000
Thread id: 7, memory content: 7.000000

My questions are:

  • Is pinned memory zero-copy? I thought only mapped pinned memory was zero-copy.
  • If it is zero-copy why have an explicit way to map it to device (cudaHostAlloc with cudaHostAllocMapped option)

I'm using CUDA Toolkit 5.5, Quadro 4000 with driver set to TCC mode, and compilation options sm_20,compute_20

Budric
  • 3,599
  • 8
  • 35
  • 38

2 Answers2

10

Congratulations! You're encountering a 2.x compute capability + TCC + 64-bit OS feature with newer CUDA versions :)

Read the rest to find out more!

First a small theory summary as CUDA taught us:

  • Pinned memory is not zero-copy since the GPU cannot access it (it's not mapped in its address space) and it's used to efficiently transfer from the host to the GPU. It's page-locked (valuable kernel resource) memory and has some performance advantages over pageable normal memory.

  • Pinned zero-copy memory is page-locked memory (usually allocated with the cudaHostAllocMapped flag) which is also used by the GPU since mapped to its address space.

Why you're accessing memory allocated from the host from the device without explicitly specifying it?

Take a look at the release notes for CUDA 4.0 (and higher):

  • (Windows and Linux) Added support for unified virtual address space.

Devices supporting 64-bit and compute 2.0 and higher capability now share a single unified address space between the host and all devices. This means that the pointer used to access memory on the host is the same as the pointer to used to access memory on the device. Therefore, the location of memory may be queried directly from its pointer value; the direction of a memory copy need not be specified.

To summarize: if your card is 2.0+ (and it is: https://developer.nvidia.com/cuda-gpus), you are running a 64-bit OS and on Windows you have a TCC mode on, you're automatically using UVA (Unified Virtual Addressing) between host and device. That means: automatically enhancing your code with zero-copy-like accesses.

This is also in the CUDA documentation for the current version in the paragraph "Automatic Mapping of Host Allocated Host Memory"

Marco A.
  • 43,032
  • 26
  • 132
  • 246
6

Mapped memory is a type of pinned memory. It is created when you pin the memory and pass the cudaHostAllocMapped flag. However, even though you've specified cudaHostAllocDefault, the memory is also "Mapped" under certain circumstances. I believe TCC mode combined with a 64-bit OS is sufficient to meet the circumstances required for the "auto-Mapped" feature.

The central issue is whether UVA is in effect. In your case, it is.

Regarding the question about why having the explicit capability, it is for use in cases where UVA is not in effect (for example in a 32-bit host OS).

from the documentation (when UVA is in effect):

Automatic Mapping of Host Allocated Host Memory

All host memory allocated through all devices using cudaMallocHost() and cudaHostAlloc() is always directly accessible from all devices that support unified addressing. This is the case regardless of whether or not the flags cudaHostAllocPortable and cudaHostAllocMapped are specified. The pointer value through which allocated host memory may be accessed in kernels on all devices that support unified addressing is the same as the pointer value through which that memory is accessed on the host. It is not necessary to call cudaHostGetDevicePointer() to get the device pointer for these allocations.

Marco A.
  • 43,032
  • 26
  • 132
  • 246
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257