0

I followed the CUDA Introduction here (https://devblogs.nvidia.com/even-easier-introduction-cuda/).

And wrote a same program as the author did. However the result on my server with GTX 1080ti is even slower than the author's GT 750M.

How could it be?

The code:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i+=stride)
        y[i] = x[i] + y[i];
}

int main(int argc, char *argv[])
{
    int N = 1<<20;
    float *x, *y;

    // Allocate Unified Memory – accessible from CPU or GPU
    cudaMallocManaged(&x, N*sizeof(float));
    cudaMallocManaged(&y, N*sizeof(float));

    // initialize x and y arrays on the host

    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    int block_size = 256;
    int num_blocks = (N + block_size - 1) / block_size;

    // Run kernel on 1M elements on the GPU
    add<<<num_blocks, block_size>>>(N, x, y);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i]-3.0f));
    std::cout << "Max error: " << maxError << std::endl;

    // Free memory
    cudaFree(x);
    cudaFree(y);

    return 0;
}

The result on my server is 4.1499ms with 1080ti, while the author gets 0.68ms with 750m.

I measured the time with nvprof command as same as the author did.

I compiled the program with nvcc with default settings, as the author did.

lucky yang
  • 1,609
  • 2
  • 13
  • 20
  • 1
    How are you compiling the program? Same compiler and optimization options as the author? – Jesper Juhl Apr 27 '19 at 09:48
  • In addition to what Jesper said, how are you actually measuring the time? – Michael Kenzel Apr 27 '19 at 11:11
  • I've updated the question description – lucky yang Apr 27 '19 at 12:51
  • which CUDA version and operating system are you using? The author's 750ti is definitely in the pre-pascal regime for UM, which means that the data migration happens efficiently, en-masse, at kernel launch time, and the data migration does not impact the kernel timing. If you are on linux with a recent version of CUDA, you are in the post-pascal regime for UM, which means that the data migration is handled via demand-paging, which is slower and less efficient, and the data migration is impacting your kernel time measurement. – Robert Crovella Apr 27 '19 at 13:10
  • See [here](https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/) and [here](https://stackoverflow.com/questions/43768717/speed-of-pascal-cuda8-1080ti-unified-memory/43809667#43809667) and [here](https://stackoverflow.com/questions/39782746/why-is-nvidia-pascal-gpus-slow-on-running-cuda-kernels-when-using-cudamallocmana/40011988#40011988) I suspect your question is a duplicate of this last one. – Robert Crovella Apr 27 '19 at 13:10

0 Answers0