Suppose I have a kernel which performs strided memory access as follows:
__global__ void strideExample (float *outputData, float *inputData, int stride=2)
{
int index = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
outputData[index] = inputData[index];
}
I understand that accesses with a stride size of 2 will result in a 50% load/store efficiency, since half of the elements involved in the transaction are not used (becoming wasted bandwidth). How do we proceed to calculate the load/store efficiency for larger stride sizes? Thanks in advance!