I am running some GPU benchmarks to understand how to maximize the memory bandwidth from/to the global memory. I have an array of 128 MB (32*1024*1024 single-precision floating point numbers) aligned to a 128 bytes margin with three halo values before and after the actual data. So, the first element in the array is aligned to a 128-bytes boundary.
In the following, n
refers to the number of elements in my array (excluding
halo): n = 32*1024*1024
. m
refers to the 128-bytes words in the array: m =
1024*1024 = 1048576
.
*array // Aligned to a 128-bytes boundary
*(array-3) // Start of the (unaligned) halo region
I also have a similar output array, which is aligned to the same boundary and does not contain halo.
I have several kernels which implement all a similar computation with different access patterns:
P1: *(output+i) = *(array+i) // for i in 0..n
P2: *(output+i) = *(array+i) + *(array+i+1)
P3: *(output+i) = *(array+i-1) + *(array+i+1)
All these computations are clearly bandwidth-bounded. I'm trying to optimize the global memory transactions. The code I'm using is pretty simple:
__global__ void P1(const float* input, float* output)
{
const int i = threadIdx.x + blockDim.x*blockIdx.x;
*(output+i) = *(input+i);
}
__global__ void P2(const float* input, float* output)
{
const int i = threadIdx.x + blockDim.x*blockIdx.x;
*(output+i) = *(input+i) + *(input+i+1);
}
__global__ void P3(const float* input, float* output)
{
const int i = threadIdx.x + blockDim.x*blockIdx.x;
*(output+i) = *(input+i-1) + *(input+i+1);
}
I have 1024 threads per block and the correct number of blocks such that every thread is assigned exactly one value of the output array.
I compiled both with caching and non-caching options (-Xptxas -dclm={ca,cg}
)
and benchmarked with nvprof, extracting the following metrics:
ldst_issued
: Issued load/store instructionsldst_executed
: Executed load/store instructionsgld_transactions
: Global load transactionsgst_transactions
: Global store transactionsdram_read_throughput
: Device memory read throughputdram_write_throughput
: Device memory write throughput
The GPU I'm using is a Nvidia K20X.
I expect ldst_executed
to be exactly (k+1) * m
, where k is 1 for P1, 2
for P2 and 3 for P3 and represents the number of values read by every thread. I
also expect gst_transactions
to be m
(coalesced access: write by 128 bytes
words) for P1, somewhere between m
and 2m
for P2 and somewhere between m
and 3m
for P3, since every warp has to access
its "assigned" portion of memory just like P1, plus the following 128 bytes for
P2, plus the previous 128 bytes for P3, but I'm not sure that the warp is the
right unit here. I'm expecting some threads to be able to avoid the global
memory access because the data are already fetched into the L1 cache by a
previous thread.
These are the results:
P1:
gld_transactions 1048576
gst_transactions 1048576
ldst_issued 2097152
ldst_executed 2097152
dram_read_throughput 92.552 GB/s
dram_write_throughput 93.067 GB/s
P2:
gld_transactions 3145728
gst_transactions 1048576
ldst_issued 5242880
ldst_executed 3145728
dram_read_throughput 80.748 GB/s
dram_write_throughput 79.878 GB/s
P3:
gld_transactions 5242880
gst_transactions 1048576
ldst_issued 8052318
ldst_executed 4194304
dram_read_throughput 79.693 GB/s
dram_write_throughput 78.510 GB/s
I already see some discrepancies:
- The number of load transactions is substantially increased from P1 to P2 and P3.
- The number of issued load/store instructions is also very high in P2 and P3, beyond what I could explain. I'm not sure I understand what this number represents.
When I turn to the non-caching tests, these are the results
P1:
gld_transactions 1048576
gst_transactions 1048576
ldst_issued 2097152
ldst_executed 2097152
dram_read_throughput 92.577 GB/s
dram_write_throughput 93.079 GB/s
P2:
gld_transactions 3145728
gst_transactions 1048576
ldst_issued 5242880
ldst_executed 3145728
dram_read_throughput 80.857 GB/s
dram_write_throughput 79.959 GB/s
P3:
gld_transactions 5242880
gst_transactions 1048576
ldst_issued 8053556
ldst_executed 4194304
dram_read_throughput 79.661 GB/s
dram_write_throughput 78.484 GB/s
As you can see, nothing changes. I was expecting to see some differences due to the fact that in the non-caching case the L1 cache is discarded, but the transaction happen in 32-bytes words.
Questions:
- Is my approach sound at all?
- Can the shared memory help me reducing the amount of transfers?
- Why don't I see a substantial difference between the caching and the non-caching cases?
- Why is P3 not slower that P2 the same way P2 is slower than P1?
- Which other metrics would help me understanding what is actually happening?