I present here some code
__constant__ int array[1024];
__global__ void kernel1(int *d_dst) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = array[tId];
}
__global__ void kernel2(int *d_dst, int *d_src) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = d_src[tId];
}
int main(int argc, char **argv) {
int *d_array;
int *d_src;
cudaMalloc((void**)&d_array, sizeof(int) * 1024);
cudaMalloc((void**)&d_src, sizeof(int) * 1024);
int *test = new int[1024];
memset(test, 0, sizeof(int) * 1024);
for (int i = 0; i < 1024; i++) {
test[i] = 100;
}
cudaMemcpyToSymbol(array, test, sizeof(int) * 1024);
kernel1<<< 1, 1024 >>>(d_array);
cudaMemcpy(d_src, test, sizeof(int) * 1024, cudaMemcpyHostToDevice);
kernel2<<<1, 32 >>>(d_array, d_src),
free(test);
cudaFree(d_array);
cudaFree(d_src);
return 0;
}
Which simply shows constant memory and global memory usage. On its execution the "kernel2" executes about 4 times faster (in terms of time) than "kernel1"
I understand from the Cuda C programming guide, that this this because accesses to constant memory are getting serialized. Which brings me to the idea that constant memory can be best utilized if a warp accesses a single constant value such as integer, float, double etc. but accessing an array is not beneficial at all. In other terms, I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more? I mean a structure might contain multiple simple types and array for example; when accessing these simple types, are these accesses also serialized or not?
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Anyone can refer me some example code where an efficient constant memory usage is shown.
regards,