Improving CUDA SDK matrixMul by the prefetching intrinsic

Question

Here is the part of CUDA SDK (2.3) matrixMultiply kernel:

for (int a = aBegin, b = bBegin;
         a <= aEnd;
         a += aStep, b += bStep) {

    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

    int XI=wA * ty + tx;
    int XII=wB * ty + tx;
    ////////////////////
    // PREFETCH BLOCK //
    ////////////////////
    AS(ty, tx) = A[a + XI];
    BS(ty, tx) = B[b + XII];

    __syncthreads();

    for (int k = 0; k < BLOCK_SIZE; ++k)
        Csub += AS(ty, k) * BS(k, tx);

    __syncthreads();
}

This version of matrix multiply brings a tile into shared memory and performs the calculation at the shared memory bandwidth. I want to improve the performance by prefetching the data of next iteration into L1 cache. I use the prefetch intrinsic as suggested here and inserted the following commands into the PREFETCH BLOCK above:

    long long int k,kk;
    k=((long long int)A+aStep); if(k<=aEnd) prefetch_l1(k+XI);
    kk=((long long int)B+bStep); if(kk<=aEnd) prefetch_l1(kk+XII);

After test, two versions (with or without prefetching) perform very similar (average of 3 runs):

without prefetching: 6434.866211 (ms)

with prefetching: 6480.041016 (ms)

Question:

I expect to see some speedup out of the prefetching but I'm confused with the results. Any body has any justification why these two implementations perform very close? Maybe I am performing a wrong prefetching.

Thank you in advance.

Further informations:

GPU: Tesla C2050

CUDA version: 4.0

inline __device__ void prefetch_l1 (unsigned int addr)
{

  asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}

score 1 · Answer 1 · answered Dec 02 '12 at 22:05

1

Prefetching (on any architecture) is only of benefit if:

you have memory bandwidth to spare and
you can initiate the prefetch at the correct time, i.e. sufficiently far ahead of time before the data is actually needed and at a time when spare memory bandwidth is available.

If you can't meet these criteria then prefetching is not going to help, and may even do more harm than good.

answered Dec 02 '12 at 22:05

Paul R

208,748
37
389
560

The data should arrive timely ahead of next iteration requests, if the prefetch requests of my code have the same priority as load/store in Tesla C2050. I do not know how Tesla works! If the priorities are fair, according to your indications, the main criteria here should be the memory bandwidth. – lashgar Dec 03 '12 at 06:03
2

Have you confirmed that your algorithm is memory bound? I think prefetching would only be useful on memory bound algorithms. On compute bound algorithms, the GPU hides memory latency by putting warps on hold while their memory requests are being serviced. So, if there are always warps ready to run, prefetching won't help. Also note that memory latency is very long compared to instruction throughput. Around 300 clocks on Fermi. You would have to do a lot of work between prefetching and using a value to hide that. – Roger Dahl Dec 03 '12 at 14:44
@RogerDahl Thank you. MatrixMultiply is memory-bound and I guess the prefetching should improve the performance. In your last sentence, do you mean the prefetching and the real accesses are not distanced enough to make useful prefetching? I've tested the prefetching for 2, 3, and 4 iterations ahead but the result is the same. – lashgar Dec 05 '12 at 12:38
1

@ahmad: Yes, that's what I meant, but now I think there's some flawed logic in my comment, and I'm really wondering when prefetching might help. I said two things that I now think are in conflict. 1) "I think prefetching would only be useful on memory bound algorithms." and 2) "if there are always warps ready to run, prefetching won't help." In (2), having warps always ready to run means that the algorithm is memory bound. Thus, it conflicts with (1). So Paul R's answer that says that prefetching is only of benefit if you have memory bandwidth to spare makes the most sense. – Roger Dahl Dec 05 '12 at 13:54
1

@ahmad, (continued) After all, if you are already using up all available memory bandwidth, it doesn't make much difference where, in your algorithm, you are initiating the transfers. So, the bottom line seems to be that prefetching cannot help you, because you have a memory bound algorithm. A slight problem is that I can't think of how prefetching can help in compute bound algorithms either! That's because bringing in data faster won't help if all available compute resources are always busy. – Roger Dahl Dec 05 '12 at 14:02
1

@ahmad: On the CPU, an algorithm can be memory bound even when the memory bandwidth is not saturated. That happens because the CPU can't switch to some other work and possibly initiate new memory transactions while waiting for the first values to arrive. It seems to me that, that is the only situation where prefetching will help, and, by its very nature, that situation does not occur on the GPU. – Roger Dahl Dec 05 '12 at 14:12
@RogerDahl I think we can say the prefetching works when the 1) application is sensitive to memory ***latency*** and 2) the memory bandwidth is not fully utilized. In my code, I guess the performance of "shared-memory matrix-multiply" is bounded by shared memory access latency not global memory. I mean, the baseline code hide the global memory latency effectively while my code tries to hide the global memory latency further (which is not bottleneck here). Does it make sense? – lashgar Dec 05 '12 at 14:13
@RogerDahl (continue) Here in GPU, prefetching works when the multithreading is not enough to ***hide*** the memory latency. – lashgar Dec 05 '12 at 14:14

Improving CUDA SDK matrixMul by the prefetching intrinsic

1 Answers1