Why does NVENC sample use both cuMemcpyHtoD and cuMemcpy2D to copy YUV data?

Question

I am learning NVIDIA NVENC API.The SDK supplies a sampled called "NvEncoderCudaInterop" .There is a chunk of code which copies YUV plane arrays from CPU to GPU buffers. This is the code:

 // copy luma
 CUDA_MEMCPY2D copyParam;
memset(&copyParam, 0, sizeof(copyParam));
copyParam.dstMemoryType = CU_MEMORYTYPE_DEVICE;
copyParam.dstDevice = pEncodeBuffer->stInputBfr.pNV12devPtr;
copyParam.dstPitch = pEncodeBuffer->stInputBfr.uNV12Stride;
copyParam.srcMemoryType = CU_MEMORYTYPE_HOST;
copyParam.srcHost = yuv[0];
copyParam.srcPitch = width;
copyParam.WidthInBytes = width;
copyParam.Height = height;
__cu(cuMemcpy2D(&copyParam));

// copy chroma

__cu(cuMemcpyHtoD(m_ChromaDevPtr[0], yuv[1], width*height / 4));
__cu(cuMemcpyHtoD(m_ChromaDevPtr[1], yuv[2], width*height / 4));

I do understand the rationale behind the procedure.The memory is copied to GPU for the kernel to process it.What I don't understand is why,in order to copy Y plane, cuMemcpy2D is used and for UV cuMemcpyHtoD?Why Y can't be copied using cuMemcpyHtoD as well?As far as I understand,YUV planes have the same linear memory layout.The only difference is their size.

PS:I asked this question originally on Computer Graphics site but got no answers.

Robert Crovella · Accepted Answer · 2015-10-08T17:22:35.980

On the host, the YUV buffer data is (assumed to be) stored as unpitched YUV 4:2:0 data stored in separate planes. That means Y has it's own plane (yuv[0]) followed by U (yuv[1]) followed by V (yuv[2]).

The intended storage target on the device is a (NV12) buffer format defined as NV_ENC_BUFFER_FORMAT_NV12_PL which the documentation (NvEncodeAPI_v.5.0.pdf, p 12) defines as:

NV_ENC_BUFFER_FORMAT_NV12_PL Semi-Planar YUV [UV interleaved] allocated as serial 2D buffer.

Note that this is intended to be:

Pitched storage (this is evident because the main buffer pointer pEncodeBuffer->stInputBfr.pNV12devPtr has been previously allocated in that file with cuMemAllocPitch)
"Semi-Planar" storage. The (unpitched) planar storage on the host implies Y followed by U followed by V. The "semi-planar" storage on the device implies Y plane followed by a special plane that has U and V interleaved:
```
U0V0  U1V1  U2V2 ...
```

So it's easy enough to copy the Y data down with a single 2D memcpy call. But the UV plane requires some assembly from separate buffers. The writers of this code chose to do the assembly as follows:

Copy the U and V planes to the device, independently, to independent, unpitched buffers. That is the code you have shown, and the independent buffers on the device are m_ChromaDevPtr[0] and m_ChromaDevPtr[1] respectively (U then V, separate, unpitched).
Assemble the UV interleaved, pitched plane on the device using a CUDA kernel. This makes sense because there is a fair amount of data movement, and the device, having higher memory bandwidth, can do this more efficiently than on the host. Also note that a single 2D memcpy call could not handle this case, because we have effectively 2 strides. One is the (short) stride from element to element, so for example the short stride from U0 to U1 in the example above. The other stride is the "longer" stride at the end of each line, the "normal" stride associated with the pitched allocation.

The kernel that accomplishes the "assembly" of the UV interleaved, pitched plane on the device from the non-interleaved, unpitched m_ChromaDevPtr[0] and m_ChromaDevPtr[1] buffers is called m_cuInterleaveUVFunction, and it is launched here (right after the code you have shown, and starting with the tail end of the code you have shown):

    __cu(cuMemcpyHtoD(m_ChromaDevPtr[0], yuv[1], width*height / 4));
    __cu(cuMemcpyHtoD(m_ChromaDevPtr[1], yuv[2], width*height / 4));

#define BLOCK_X 32
#define BLOCK_Y 16
    int chromaHeight = height / 2;
    int chromaWidth = width / 2;
    dim3 block(BLOCK_X, BLOCK_Y, 1);
    dim3 grid((chromaWidth + BLOCK_X - 1) / BLOCK_X, (chromaHeight + BLOCK_Y - 1) / BLOCK_Y, 1);
#undef BLOCK_Y
#undef BLOCK_X

    CUdeviceptr dNV12Chroma = (CUdeviceptr)((unsigned char*)pEncodeBuffer->stInputBfr.pNV12devPtr + pEncodeBuffer->stInputBfr.uNV12Stride*height);
    void *args[8] = { &m_ChromaDevPtr[0], &m_ChromaDevPtr[1], &dNV12Chroma, &chromaWidth, &chromaHeight, &chromaWidth, &chromaWidth, &pEncodeBuffer->stInputBfr.uNV12Stride};

    __cu(cuLaunchKernel(m_cuInterleaveUVFunction, grid.x, grid.y, grid.z,
        block.x, block.y, block.z,
        0,
        NULL, args, NULL));
    CUresult cuResult = cuStreamQuery(NULL);
    if (!((cuResult == CUDA_SUCCESS) || (cuResult == CUDA_ERROR_NOT_READY)))
    {
        return NV_ENC_ERR_GENERIC;
    }
    return NV_ENC_SUCCESS;
}

Note that some of the arguments being passed to this "UV Assembly" kernel are:

The pointers to the separate U and V buffers on the device (e.g. &m_ChromaDevPtr[0] etc.)
The pointer to the starting location in the main buffer where the UV interleaved plane will be (&dNV12Chroma)
A pointer to the pitch of the target buffer (&pEncodeBuffer->stInputBfr.uNV12Stride)

just as you would expect if you were going to write your own kernel to do that assembly. If you want to see whats actually in the assembly kernel, it is in the preproc.cu file in that sample project.

EDIT: Responding to question in the comments. On the host, the Y data is stored like this (let's pretend the lines only have 4 elements each. This is not really correct for YUV 4:2:0 data, but the focus here is on the copying operation, not the line length):

Y0  Y1  Y2  Y3
Y4  Y5  Y6  Y7
....

On the device, that buffer is organized as follows:

Y0  Y1  Y2  Y3  X  X  X  X
Y4  Y5  Y6  Y7  X  X  X  X
...

where the X values are padding to make each line equal the pitch. To copy from the host buffer above to the device buffer above, we must use a pitched copy, i.e. cuMemcpy2D.

On the host, the U data is organized as follows:

U0  U1  U2  U3
U4  U5  U6  U7
....

and the V data is organized similarly:

V0  V1  V2  V3
V4  V5  V6  V7
....

On the device, both the above U and V data will eventually be combined into a single UV plane that is also pitched like so:

U0V0  U1V1  U2V2  U3V3  X  X  X  X
U4V4  U5V5  U6V6  U7V7  X  X  X  X
...

There is no single memcpy operation that can properly grab the data from the unpitched host U-only and V-only buffers, and deposit that data according to the above pattern. It requires assembly of the U and V buffers together, and then depositing that data in the pitched destination buffer. This is handled first by copying the U and V data to separate device buffers that are organized exactly the same way as on the host:

U0  U1  U2  U3
U4  U5  U6  U7
....

This type of copy is handled with the ordinary, unpitched cuMemcpyHtoD

Here's a diagram of the operations:

Notes:

The copy of the Y-data cannot be done with an ordinary cuMemcpyHtoD, because the destination data is pitched.
The copy of the U and V data is from an unpitched buffer to an unpitched buffer, so it can use cuMemcpyHtoD.
The host-to-device copy of the U and V data cannot go directly to the NV12 buffer, because there is no cuMemcpy operation (2D or otherwise) that can handle that specific destination storage pattern.

Thanks for your detailed answer.I was quite able to figure out most of this stuff on my own.But I am not sure I understood your explanation regarding why to use cuMemcpy2D for Y and not cuMemcpyHtoD.Let's see it reversely.Why not to copy U and V blocks from host using cuMemcpy2D ? Based on how it sits in memory I can't see why it is not copied the same way as Y block.That's ,on the host it is linear memory buffer which has |---Y---|---U---|---V---| per frame.So unless the pitch value is what we are after then I don't see why to use it. — Michael IV, Oct 08 '15 at 16:15
I think I covered this in my answer. *All* of the host buffers are unpitched. The final destination buffer is supposed to be a pitched buffer consisting of 2 planes. The Y plane: Y0 Y1 Y2... and the UV plane: U0V0 U1V1 U2V2 ... The Y data can be directly copied from yuv[0] to the Y plane, but since the device Y plane is a **pitched** buffer we must use a pitched copy, which is why the memcpy2D is used. The UV data has to follow another path. Since the first step in that other path is the copy from an unpitched buffer **to an unpitched buffer**, the non-pitched copy types are used. — Robert Crovella, Oct 08 '15 at 16:23

score 2 · Answer 2 · answered Jan 21 '18 at 06:28

Just wanted to add, this NVENC sample was written this way to demonstrate various way that an application might write into the padded GPU input buffer. One can use cuMemcpy2D for host-to-device /w padding, or cuMemcpyHtoD for transfers, or use a CUDA kernel to write data into buffer. This is what the sample demonstrates.

However, it is not necessarily the most efficient. The above strategy requires 2x temporary GPU buffers, 1x final padded GPU buffer, 3x CUDA mem copies, and 1x CUDA launch.

A more efficient strategy would be this.. Assuming the host planar buffers are contiguous (memory order of: Y plane, then U plane, then V plane), then allocate 1x GPU buffer of the exact same total size (host Y+U+V), and also the required 1x final padded GPU buffer. Then, perform 1x host-to-device copy (cuMemcpyHtoD) to transfer the host to temp GPU, and perform 1x CUDA launch to write all portions of the padded buffer in one kernel. This is moving the same amount of data, but takes only 1x HtoD copy, and 1x CUDA launch, reducing launch overhead considerably.

Also, if you are doing this with multiple frames, you can use cuMemcpyHtoDAsync and overlap the copies of one frame with the kernels of the last. And, be sure you allocate the source host frame using cuMemAllocHost, to get the additional performance gain of pinned memory transfers.

PS. I work for NVIDIA

Why does NVENC sample use both cuMemcpyHtoD and cuMemcpy2D to copy YUV data?

2 Answers2

Linked