3

For a perfectly coalesced accesses to an array of 4096 doubles, each 8 bytes, nvprof reports the following metrics on a Nvidia Tesla V100:

global_load_requests: 128
gld_transactions: 1024
gld_transactions_per_request: 8.000000

I cannot find a specific definition of what a transaction and a request to global memory are exactly, so I am having trouble understanding these metrics. Therefore my questions:

  1. How is a memory request defined?
  2. How is a memory transaction defined?
  3. Does gld_transactions_per_request = 8.00000 indicate perfectly coalesced accesses to doubles?

In an attempt to explain it to myself, this what I have come up with:

  • Request: a load on the warp-level, i.e. one warp-level instruction merged from 32 threads. In this scenario a 32 threads * 8 bytes = 256 byte load. -- Is this correct?
  • Transaction: a 32 byte load instruction. In this scenario one transaction defined in this way is able to load 32 bytes / 8 bytes = 4 doubles. -- Is this correct? If so, is this the largest load instruction Cuda implements?

Using these definitions, I arrive at the same values as nvprof: Accessing 4096 array items requires 128 warp-level instructions (=requests) with 32 threads each. Using 32 byte loads (=transactions) results in the 1024 transactions.

anroesti
  • 11,053
  • 3
  • 22
  • 33
  • 1
    I think your explanation is correct. The memory transactions happen in 32 bytes. [documentation](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory) . And load requests are equal to how many `ld.global.b64` instructions are executed. – heapoverflow Mar 05 '20 at 16:31
  • Thanks very much for the documentation link, that clears up the transaction size of 32 bytes (which I just assumed because the numbers added up that way). For anyone else wondering, the relevant portion is in section 9.2.1 of the best practices guide. The same type of source for the request definition would be wonderful, but I'm happy with this knowledge for now. – anroesti Mar 06 '20 at 10:16
  • " In this scenario a 32 warps * 8 bytes = 256 byte load. -- Is this correct?" It looks like a typo. is it "32 warps" or "32 threads"? – Gnimuc Mar 09 '20 at 01:56
  • @Gnimuc Yes, I think that is what I actually had in mind. I corrected it now. – anroesti Mar 10 '20 at 13:51

1 Answers1

1

A memory "request" is an instruction which accesses memory, and a "transaction" is the movement of a unit of data between two regions of memory.

s.feng
  • 11
  • 1