For a perfectly coalesced accesses to an array of 4096
doubles, each 8 bytes, nvprof
reports the following metrics on a Nvidia Tesla V100:
global_load_requests: 128
gld_transactions: 1024
gld_transactions_per_request: 8.000000
I cannot find a specific definition of what a transaction and a request to global memory are exactly, so I am having trouble understanding these metrics. Therefore my questions:
- How is a memory request defined?
- How is a memory transaction defined?
- Does
gld_transactions_per_request = 8.00000
indicate perfectly coalesced accesses to doubles?
In an attempt to explain it to myself, this what I have come up with:
- Request: a load on the warp-level, i.e. one warp-level instruction merged from 32 threads. In this scenario a
32 threads * 8 bytes = 256 byte
load. -- Is this correct? - Transaction: a
32 byte
load instruction. In this scenario one transaction defined in this way is able to load32 bytes / 8 bytes = 4
doubles. -- Is this correct? If so, is this the largest load instruction Cuda implements?
Using these definitions, I arrive at the same values as nvprof
: Accessing 4096 array items requires 128 warp-level instructions (=requests) with 32 threads each. Using 32 byte loads (=transactions) results in the 1024 transactions.