How to interpret nvprof results of cuda bandwidth-limited kernel?

Question

I am running some GPU benchmarks to understand how to maximize the memory bandwidth from/to the global memory. I have an array of 128 MB (32*1024*1024 single-precision floating point numbers) aligned to a 128 bytes margin with three halo values before and after the actual data. So, the first element in the array is aligned to a 128-bytes boundary.

In the following, n refers to the number of elements in my array (excluding halo): n = 32*1024*1024. m refers to the 128-bytes words in the array: m = 1024*1024 = 1048576.

*array     // Aligned to a 128-bytes boundary
*(array-3) // Start of the (unaligned) halo region

I also have a similar output array, which is aligned to the same boundary and does not contain halo.

I have several kernels which implement all a similar computation with different access patterns:

P1: *(output+i) = *(array+i) // for i in 0..n
P2: *(output+i) = *(array+i) + *(array+i+1)
P3: *(output+i) = *(array+i-1) + *(array+i+1)

All these computations are clearly bandwidth-bounded. I'm trying to optimize the global memory transactions. The code I'm using is pretty simple:

__global__ void P1(const float* input, float* output)
{
    const int i = threadIdx.x + blockDim.x*blockIdx.x;
    *(output+i) = *(input+i);
}

__global__ void P2(const float* input, float* output)
{
    const int i = threadIdx.x + blockDim.x*blockIdx.x;
    *(output+i) = *(input+i) + *(input+i+1);
}

__global__ void P3(const float* input, float* output)
{
    const int i = threadIdx.x + blockDim.x*blockIdx.x;
    *(output+i) = *(input+i-1) + *(input+i+1);
}

I have 1024 threads per block and the correct number of blocks such that every thread is assigned exactly one value of the output array.

I compiled both with caching and non-caching options (-Xptxas -dclm={ca,cg}) and benchmarked with nvprof, extracting the following metrics:

ldst_issued: Issued load/store instructions
ldst_executed: Executed load/store instructions
gld_transactions: Global load transactions
gst_transactions: Global store transactions
dram_read_throughput: Device memory read throughput
dram_write_throughput: Device memory write throughput

The GPU I'm using is a Nvidia K20X.

I expect ldst_executed to be exactly (k+1) * m, where k is 1 for P1, 2 for P2 and 3 for P3 and represents the number of values read by every thread. I also expect gst_transactions to be m (coalesced access: write by 128 bytes words) for P1, somewhere between m and 2m for P2 and somewhere between m and 3m for P3, since every warp has to access its "assigned" portion of memory just like P1, plus the following 128 bytes for P2, plus the previous 128 bytes for P3, but I'm not sure that the warp is the right unit here. I'm expecting some threads to be able to avoid the global memory access because the data are already fetched into the L1 cache by a previous thread.

These are the results:

P1:

     gld_transactions   1048576
     gst_transactions   1048576
          ldst_issued   2097152
        ldst_executed   2097152
 dram_read_throughput   92.552 GB/s
dram_write_throughput   93.067 GB/s

P2:

     gld_transactions   3145728
     gst_transactions   1048576
          ldst_issued   5242880
        ldst_executed   3145728
 dram_read_throughput   80.748 GB/s
dram_write_throughput   79.878 GB/s

P3:

     gld_transactions   5242880
     gst_transactions   1048576
          ldst_issued   8052318
        ldst_executed   4194304
 dram_read_throughput   79.693 GB/s
dram_write_throughput   78.510 GB/s

I already see some discrepancies:

The number of load transactions is substantially increased from P1 to P2 and P3.
The number of issued load/store instructions is also very high in P2 and P3, beyond what I could explain. I'm not sure I understand what this number represents.

When I turn to the non-caching tests, these are the results

P1:

     gld_transactions   1048576
     gst_transactions   1048576
          ldst_issued   2097152
        ldst_executed   2097152
 dram_read_throughput   92.577 GB/s
dram_write_throughput   93.079 GB/s

P2:

     gld_transactions   3145728
     gst_transactions   1048576
          ldst_issued   5242880
        ldst_executed   3145728
 dram_read_throughput   80.857 GB/s
dram_write_throughput   79.959 GB/s

P3:

     gld_transactions   5242880
     gst_transactions   1048576
          ldst_issued   8053556
        ldst_executed   4194304
 dram_read_throughput   79.661 GB/s
dram_write_throughput   78.484 GB/s

As you can see, nothing changes. I was expecting to see some differences due to the fact that in the non-caching case the L1 cache is discarded, but the transaction happen in 32-bytes words.

Questions:

Is my approach sound at all?
Can the shared memory help me reducing the amount of transfers?
Why don't I see a substantial difference between the caching and the non-caching cases?
Why is P3 not slower that P2 the same way P2 is slower than P1?
Which other metrics would help me understanding what is actually happening?

If you add a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve), others can benchmark themselves. — m.s., Jul 28 '15 at 12:20
1) K20X is GK110, which doesn't use L1 for loads from global memory. All your loads are non-caching ones. 2) gld_transactions increase 3x for P2 due to *(input+i+1) taking 2 transactions 3) gld_transactions for P3 is suspicious. It should be 4x but you have 5x. 4) ldst_issued is strange for both P2 and P3, and ldst_executed is not what I would expect for P3. Are you sure you profiled exactly these kernels? — Maxim Milakov, Jul 28 '15 at 13:48
1) Consider using __restrict qualifier for the parameters to allow compiler utilize read-only cache 2) COnsider processing multiple elements per thread to have more loads in flight to achieve higher global memory bandwidth utilization — Maxim Milakov, Jul 28 '15 at 13:51
1) is getting >180 GB/s combined read+write, which is what I see with the stream benchmark on K20X. As Maxim points out, you could try using texture memory for reads (try using the __ldg intrinsic) 2) the other kernels are getting 160 GB/s, which is a good proportion of stream bandwidth. They have many more memory transactions for the reads, which are probably served by l2 cache. Have a look at metrics for l2 cache. Shared memory would reduce the number of load instructions, and I have seen ~10% speedup using shared memory for such kernels. — bcumming, Jul 29 '15 at 08:08

How to interpret nvprof results of cuda bandwidth-limited kernel?

0 Answers0