Why is cuda pointer memory access slower than global device memory access?

Question

#include <vector_functions.h>
#include <vector_types.h>

#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <string>

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

__device__ int foo[16];
__device__ int bar[16];

__global__ void go(const int* ptr) {
  printf("device: tid = %d, foo = %p\n", blockIdx.x, foo);
  printf("device: tid = %d, ptr = %p\n", blockIdx.x, ptr);

  int val = threadIdx.x;
  for (int i = 0; i < (1 << 20); i++) {
    bar[blockIdx.x] = val;
    val = (val * 19 + ptr[threadIdx.x]) % (int)(1e9 + 7); // change ptr to foo for experiment
  }
}

int main() {
  int* ptr = nullptr;
  cudaGetSymbolAddress((void**)&ptr, foo);

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  cudaEventRecord(start);
  go<<<16, 16>>>(ptr);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaDeviceSynchronize();

  float ms;
  cudaEventElapsedTime(&ms, start, stop);
  printf("%.6fms\n", ms);

  return 0;
}

On my GeForce GTX 1080: Using ptr takes 180ms but using foo only takes 36ms although ptr and foo point to the exact same address. I thought they should perform at the same speed because they are all global memory cached by L2.

I am using Linux and my compilation command is:

nvcc -gencode=arch=compute_61,code=compute_61 -Xptxas -O3 test.cu -o test

Can anybody explain why?

Robert Crovella · Answer 1 · 2021-06-28T19:34:43.507

The reason for the difference in the two cases, is that when foo is used explicitly, the compiler (ptxas, in this case) knows that foo does not alias bar, and so can make a specific optimization. When the kernel argument ptr is used instead, the compiler does not know whether this aliasing is occurring, and assumes it might be. This has significant ramifications for device code generation.

As a proof point, recompile your test case with the following kernel prototype:

__global__ void go(const int*  __restrict__ ptr) {

and you will see that the time difference disappears. This is informing the compiler that ptr cannot alias any other known location (such as bar) and so this allows similar code generation in both cases. (In the real world, you would/should only use such decoration when you are prepared to make that kind of contract with the compiler.)

Details:

It's important to remember that the device code compiler is an optimizing compiler. Furthermore, the device code compiler is interested primarily in correctness from a single-thread point of view. Multithreaded access to the same location is not in view of this answer, and indeed is not considered by the device code compiler. It is the programmer's responsibility to ensure correctness when multiple threads are accessing the same location.

With that preamble, the primary difference here appears to be one of optimization. With knowledge that foo (or ptr) does not alias bar and considering only a single thread of execution, it is fairly evident that your kernel loop code could be rewritten as:

int val = threadIdx.x;
int ptrval = ptr[threadIdx.x];  // becomes a LDG instruction
for (int i = 0; i < ((1 << 20)-1); i++) {
 val = (val * 19 + ptrval) % (int)(1e9 + 7); 
} 
bar[blockIdx.x] = val;          // becomes a STG instruction

A major impact of this optimization is that we go from writing bar many times to just once. With this optimization, the reads of ptr can also be "optimized into a register" (since we now know it is loop-invariant). The net effect being that all global loads and stores in the loop are eliminated. On the other hand, if ptr may or may not alias bar, then we must allow for the possibility, and the above optimization would not hold.

This appears to be roughly what the compiler is doing. In the case where we use foo (or __restrict__), the compiler has arranged (in the sass code) a single global load at the beginning, a single global store at the end, and a partially unrolled loop full of integer arithmetic.

However, when we leave the code as-is/as-posted, the compiler has also partially unrolled the loop, but has sprinkled LDG and STG instructions throughout the partially unrolled loop.

You can observe this yourself using the cuda binary utilities, for example:

cuobjdump -sass test

(for each case)

The device code printf statements don't materially change any of the observations here, so for simplicity of analysis I would just remove those.

Why is cuda pointer memory access slower than global device memory access?

1 Answers1