CUDA device memory copies: cudaMemcpyDeviceToDevice vs copy kernel

Question

I am writing a cuda kernel to copy an array to another. Both of them are in GPU memory. I don't want to use cudamemcpyDeviceToDevice because of its poor performance.

The naive kernel:

__global__ void GpuCopy( float* des , float* __restrict__ sour ,const int M , const int N )
{
    int tx=blockIdx.x*blockDim.x+threadIdx.x;
    if(tx<N*M)
        des[tx]=sour[tx];
}

I think the naive kernel will not get high performance, so I try to use __shared__ memory but it looks not good:

__shared__ float TILE[tile];
int tid=threadIdx.x;
for(int i=0; i<M*N/tile;i++)
{
    TILE[tid]=sour[i*tile+tid]
    des[i*tile+tid]=TILE[tid]
}

The former code snippet copies global memory to des[], while the latter copies global memory to __shared__ and then copies __shared__ to des[]. I think that the latter is slower than the former.

So, how to write a __shared__ code to copy memory? Another question is if I want to use __const__ memory and the array (which is already in GPU) is larger than constant memory, how to copy it to anther GPU memory with __const__?

Why to you think `cudamemcpyDeviceToDevice` has poor performance? — talonmies, Mar 12 '14 at 08:16
comments in http://stackoverflow.com/questions/22284533/cuda-least-square-solving-poor-in-speed — Zziggurats, Mar 12 '14 at 08:23
I don't think you need shared memory. You have to copy two arrays elementwise and there is no thread cooperation in your first kernel. — Vitality, Mar 12 '14 at 10:11

score 5 · Accepted Answer · answered Mar 12 '14 at 14:26

For ordinary linear-to-linear memory copying, shared memory won't give you any benefit. Your naive kernel should be fine. There may be some small optimizations that could be made in terms of running with a smaller number of threadblocks, but tuning this will be dependent on your specific GPU, to some degree.

Shared memory can be used to good effect in kernels that do some kind of modified copying, such as a transpose operation. In these cases, the cost of the trip through shared memory is offset by the improved coalescing performance. But with your naive kernel, both reads and writes should coalesce.

For a single large copy operation, cudaMemcpyDeviceToDevice should give very good performance, as the overhead of the single call is amortized over the entire data movement. Perhaps you should time the two approaches -- it's easy to do with nvprof. The discussions referenced in the comments refer to a specific use-case where matrix quadrants are being swapped. In that case, an NxN matrix requires ~1.5N cudaMemcpy operations, but is being compared to a single kernel call. In that case, the overhead of the API call setup will start to become a significant factor. However, when comparing a single cudaMemcpy operation to a single equivalent kernel call, the cudaMemcpy operation should be fast.

__constant__ memory cannot be modified by device code, so you will have to use a host code based on cudaMemcpyFromSymbol and cudaMemcpyToSymbol.

score 5 · Answer 2 · answered Jan 30 '18 at 16:12

Robert Crovella has already answered to this question. I'm here just providing a sample code to compare two approaches for memory copies from device to device in CUDA:

using cudaMemcpyDeviceToDevice;
using a copy kernel.

THE CODE

The test code is the following:

#include <stdio.h>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

#define BLOCKSIZE   512

/***************/
/* COPY KERNEL */
/***************/
__global__ void copyKernel(const double * __restrict__ d_in, double * __restrict__ d_out, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid >= N) return;

    d_out[tid] = d_in[tid];

}

/********/
/* MAIN */
/********/
int main() {

    const int N = 1000000;

    TimingGPU timerGPU;

    double *h_test = (double *)malloc(N * sizeof(double));

    for (int k = 0; k < N; k++) h_test[k] = 1.;

    double *d_in;   gpuErrchk(cudaMalloc(&d_in, N * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_in, h_test, N * sizeof(double), cudaMemcpyHostToDevice));

    double *d_out; gpuErrchk(cudaMalloc(&d_out, N * sizeof(double)));

    timerGPU.StartCounter();
    gpuErrchk(cudaMemcpy(d_out, d_in, N * sizeof(double), cudaMemcpyDeviceToDevice));
    printf("cudaMemcpy timing = %f [ms]\n", timerGPU.GetCounter());

    timerGPU.StartCounter();
    copyKernel << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(d_in, d_out, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Copy kernel timing = %f [ms]\n", timerGPU.GetCounter());

    return 0;
}

The Utilities.cu and Utilities.cuh files are maintained here, while the The TimingGPU.cu and TimingGPU.cuh are maintained here.

THE TIMING

Tests performed on a GeForce GTX960 card. Timings are in ms.

N           cudaMemcpyDeviceToDevice           copy kernel
1000        0.0075                             0.029
10000       0.0078                             0.072
100000      0.019                              0.068
1000000     0.20                               0.22

The results confirm Robert Crovella's conjecture: cudaMemcpyDeviceToDevice is generally preferable over a copy kernel.

score 4 · Answer 3 · answered Dec 15 '19 at 18:06

#include <iostream>
#include <vector>
#include <iomanip>
#include <cuda_runtime.h>

#define CHECK_CUDA(cond) check_cuda(cond, __LINE__)

void check_cuda(cudaError_t status, std::size_t line)
{
    if(status != cudaSuccess)
    {
        std::cout << cudaGetErrorString(status) << '\n';
        std::cout << "Line: " << line << '\n';
        throw 0;
    }
}

__global__ void copy_kernel(float* __restrict__ output, const float* __restrict__ input, int N)
{
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;  i < N; i += blockDim.x * gridDim.x) 
        output[i] = input[i];
}

int main()
{
    constexpr int num_trials = 100;
    std::vector<int> test_sizes = { 100'000, 1'000'000, 10'000'000, 100'000'000, 250'000'000 };

    int grid_size = 0, block_size = 0;
    CHECK_CUDA(cudaOccupancyMaxPotentialBlockSize(&grid_size, &block_size, copy_kernel, 0));

    std::cout << std::fixed << std::setprecision(4) << std::endl;

    for (auto sz : test_sizes)
    {
        std::cout << "Test Size: " << sz << '\n';

        float *d_vector_src = nullptr, *d_vector_dest = nullptr;
        CHECK_CUDA(cudaMalloc(&d_vector_src, sz * sizeof(float)));
        CHECK_CUDA(cudaMalloc(&d_vector_dest, sz * sizeof(float)));

        cudaEvent_t start, stop;
        CHECK_CUDA(cudaEventCreate(&start));
        CHECK_CUDA(cudaEventCreate(&stop));

        float accumulate = 0.0;
        for (int i = 0; i < num_trials; i++)
        {
            CHECK_CUDA(cudaEventRecord(start));
            copy_kernel<<<grid_size, block_size>>>(d_vector_dest, d_vector_src, sz);
            CHECK_CUDA(cudaEventRecord(stop));
            CHECK_CUDA(cudaEventSynchronize(stop));

            float current_time = 0;
            CHECK_CUDA(cudaEventElapsedTime(&current_time, start, stop));
            accumulate += current_time;
        }
        std::cout << "\tKernel Copy Time: " << accumulate / num_trials << "ms\n";

        accumulate = 0.0;
        for (int i = 0; i < num_trials; i++)
        {
            CHECK_CUDA(cudaEventRecord(start));
            CHECK_CUDA(cudaMemcpy(d_vector_dest, d_vector_src, sz * sizeof(float), cudaMemcpyDeviceToDevice));
            CHECK_CUDA(cudaEventRecord(stop));
            CHECK_CUDA(cudaEventSynchronize(stop));

            float current_time = 0;
            CHECK_CUDA(cudaEventElapsedTime(&current_time, start, stop));
            accumulate += current_time;
        }
        std::cout << "\tMemcpy Time: " << accumulate / num_trials << "ms\n";

        CHECK_CUDA(cudaFree(d_vector_src));
        CHECK_CUDA(cudaFree(d_vector_dest));
    }

    return 0;
}

GTX 1050 Mobile

Test Size: 100000
        Kernel Copy Time: 0.0118ms
        Memcpy Time: 0.0127ms
Test Size: 1000000
        Kernel Copy Time: 0.0891ms
        Memcpy Time: 0.0899ms
Test Size: 10000000
        Kernel Copy Time: 0.8697ms
        Memcpy Time: 0.8261ms
Test Size: 100000000
        Kernel Copy Time: 8.8871ms
        Memcpy Time: 8.2401ms
Test Size: 250000000
        Kernel Copy Time: 22.3060ms
        Memcpy Time: 20.5419ms

GTX 1080 Ti

Test Size: 100000
    Kernel Copy Time: 0.0166ms
    Memcpy Time: 0.0188ms
Test Size: 1000000
    Kernel Copy Time: 0.0580ms
    Memcpy Time: 0.0727ms
Test Size: 10000000
    Kernel Copy Time: 0.4674ms
    Memcpy Time: 0.5047ms
Test Size: 100000000
    Kernel Copy Time: 4.7992ms
    Memcpy Time: 3.7722ms
Test Size: 250000000
    Kernel Copy Time: 7.2485ms
    Memcpy Time: 5.5863ms
Test Size: 1000000000
    Kernel Copy Time: 31.5570ms
    Memcpy Time: 22.3184ms

RTX 2080 Ti

Test Size: 100000
    Kernel Copy Time: 0.0048ms
    Memcpy Time: 0.0054ms
Test Size: 1000000
    Kernel Copy Time: 0.0193ms
    Memcpy Time: 0.0220ms
Test Size: 10000000
    Kernel Copy Time: 0.1578ms
    Memcpy Time: 0.1537ms
Test Size: 100000000
    Kernel Copy Time: 2.1156ms
    Memcpy Time: 1.5006ms
Test Size: 250000000
    Kernel Copy Time: 5.5195ms
    Memcpy Time: 3.7424ms
Test Size: 1000000000
    Kernel Copy Time: 23.2106ms
    Memcpy Time: 14.9483ms

This is an amazing answer. I upvoted. I looked at your network profile and you have a LOT of points in physics and chemistry. I wonder if you might support my proposal to make a [Materials Modeling Stack Exchange](https://area51.stackexchange.com/proposals/122958/materials-modeling?referrer=M2EwM2Y4N2IzOWQ5OTk0ZTliNDQyNGMwZjg2NGVhOWIyODIzOWNiZTQ3ZWZjNGE0ZjgxOWUwZDY5ZjI4MDk3NY-HFqMF39e1D2sC70PKA2uXNrAuQag7Xp36DLw_lVuq0). I've been working very hard on it but we still need more committers. — Nike, Jan 25 '20 at 22:11

CUDA device memory copies: cudaMemcpyDeviceToDevice vs copy kernel

3 Answers3