Here is the part of CUDA SDK (2.3) matrixMultiply kernel:
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
int XI=wA * ty + tx;
int XII=wB * ty + tx;
////////////////////
// PREFETCH BLOCK //
////////////////////
AS(ty, tx) = A[a + XI];
BS(ty, tx) = B[b + XII];
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
__syncthreads();
}
This version of matrix multiply brings a tile into shared memory and performs the calculation at the shared memory bandwidth. I want to improve the performance by prefetching the data of next iteration into L1 cache. I use the prefetch intrinsic as suggested here and inserted the following commands into the PREFETCH BLOCK
above:
long long int k,kk;
k=((long long int)A+aStep); if(k<=aEnd) prefetch_l1(k+XI);
kk=((long long int)B+bStep); if(kk<=aEnd) prefetch_l1(kk+XII);
After test, two versions (with or without prefetching) perform very similar (average of 3 runs):
without prefetching: 6434.866211 (ms)
with prefetching: 6480.041016 (ms)
Question:
I expect to see some speedup out of the prefetching but I'm confused with the results. Any body has any justification why these two implementations perform very close? Maybe I am performing a wrong prefetching.
Thank you in advance.
Further informations:
GPU: Tesla C2050
CUDA version: 4.0
inline __device__ void prefetch_l1 (unsigned int addr)
{
asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}