The reason for the difference in the two cases, is that when foo
is used explicitly, the compiler (ptxas
, in this case) knows that foo
does not alias bar
, and so can make a specific optimization. When the kernel argument ptr
is used instead, the compiler does not know whether this aliasing is occurring, and assumes it might be. This has significant ramifications for device code generation.
As a proof point, recompile your test case with the following kernel prototype:
__global__ void go(const int* __restrict__ ptr) {
and you will see that the time difference disappears. This is informing the compiler that ptr
cannot alias any other known location (such as bar
) and so this allows similar code generation in both cases. (In the real world, you would/should only use such decoration when you are prepared to make that kind of contract with the compiler.)
Details:
It's important to remember that the device code compiler is an optimizing compiler. Furthermore, the device code compiler is interested primarily in correctness from a single-thread point of view. Multithreaded access to the same location is not in view of this answer, and indeed is not considered by the device code compiler. It is the programmer's responsibility to ensure correctness when multiple threads are accessing the same location.
With that preamble, the primary difference here appears to be one of optimization. With knowledge that foo
(or ptr
) does not alias bar
and considering only a single thread of execution, it is fairly evident that your kernel loop code could be rewritten as:
int val = threadIdx.x;
int ptrval = ptr[threadIdx.x]; // becomes a LDG instruction
for (int i = 0; i < ((1 << 20)-1); i++) {
val = (val * 19 + ptrval) % (int)(1e9 + 7);
}
bar[blockIdx.x] = val; // becomes a STG instruction
A major impact of this optimization is that we go from writing bar
many times to just once. With this optimization, the reads of ptr
can also be "optimized into a register" (since we now know it is loop-invariant). The net effect being that all global loads and stores in the loop are eliminated. On the other hand, if ptr
may or may not alias bar
, then we must allow for the possibility, and the above optimization would not hold.
This appears to be roughly what the compiler is doing. In the case where we use foo
(or __restrict__
), the compiler has arranged (in the sass code) a single global load at the beginning, a single global store at the end, and a partially unrolled loop full of integer arithmetic.
However, when we leave the code as-is/as-posted, the compiler has also partially unrolled the loop, but has sprinkled LDG
and STG
instructions throughout the partially unrolled loop.
You can observe this yourself using the cuda binary utilities, for example:
cuobjdump -sass test
(for each case)
The device code printf
statements don't materially change any of the observations here, so for simplicity of analysis I would just remove those.