-1

I am relatively new to CUDA programming.

In this blog (How to Access Global Memory Efficiently in CUDA C/C++ Kernels), we have the following:

"The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size."

The 128-byte transaction is also mentioned in this post (The cost of CUDA global memory transactions)

In addition, 32-and 128-byte memory transactions are also mentioned in the CUDA C Programming Guide. This guide also show Figure 20 about aligned and mis-aligned access, that I couldn't quite understand.


  1. Explain and give examples on how 32-, 64-, 128-byte transactions would happen?
  2. Go through Figure 20 in more details. What is the point that the Figure is making?
Eduardo Reis
  • 1,691
  • 1
  • 22
  • 45
  • These basic understanding concepts are covered in a variety of CUDA training materials. To get a solid understanding of the figures here, I would recommend the first 4 units of [this training series](https://www.olcf.ornl.gov/cuda-training-series/), where unit 4 is the one that specifically explains Figure 20 in more detail. Following that material will also give you a pretty good understanding of how "32-, 64-, or 128-byte transactions" come about. Otherwise you are asking for a large amount of introductory material to be covered to give a solid answer. – Robert Crovella May 06 '22 at 20:43
  • Hi Robert, thank you for pointing that out and also for reminding me of the training series. – Eduardo Reis May 06 '22 at 20:49

1 Answers1

5

Both of these need to be understood in the context of a CUDA warp. All operations are issued warp-wide, and this includes instructions that access memory.

An individual CUDA thread can access 1,2,4,8,or 16 bytes in a single instruction or transaction. When considered warp-wide, that translates to 32 bytes all the way up to 512 bytes. The GPU memory controller can typically issue requests to memory in granularities of 32 bytes, up to 128 bytes. Larger requests (say, 512 bytes, considered warp wide) will get issued via multiple "transactions" of typically no more than 128 bytes.

Modern DRAM memory has the design characteristic that you don't typically ask for a single byte, you request a "segment" typically of 32 bytes at a time for typical GPU designs. The division of memory into segments is fixed at design time. As a result, you can request either the first 32 bytes (the first segment) or the second 32 bytes (the second segment). You cannot request bytes 16-47 for example. This is all a function of the DRAM design, but it manifests in terms of memory behavior.

The diagram(s) depicts the behavior of each thread in a warp. Individually, they are depicted by the gray/black arrows pointing upwards. Each arrow represents the request from a thread, and the arrow points to a relative location in memory that that thread would like to load or store.

The diagrams are presented in comparison to each other to show the effect of "alignment". When considered warp-wide, if all 32 threads are requesting bytes of data that belong to a single segment, this would require the memory controller to retrieve only one segment to satisfy the request. This would arguably be the most efficient possible behavior (and therefore data organization as well as access pattern, considered warp-wide) for a single request (i.e. a single load or store instruction).

However if the addresses emanating from each thread in the warp result in a pattern depicted in the 2nd figure, this would be "unaligned", and even though you are effectively asking for a similar data "footprint", the lack of alignment to a single segment means the memory controller will need to retrieve 2 segments from memory to satisfy the request.

That is the key point of understanding associated with the figure. But there is more to the story than that. Misaligned access is not necessarily as tragic (performance cut in half) as this might suggest. The GPU caches play a role here, when we consider these transactions not just in the context of a single warp, but across many warps.

To get a more complete and orderly treatment of these topics, I suggest referring to various training material. It's by no means the only one, but unit 4 of this training series will cover the topic in more detail.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thank you so much for the answer. When I first took a look at this picture, it was not clear that the data "footprint" itself was not the issue. But, after reading your response it became very clear that the mis-alignment in the picture is cause due to scenarios depicted by the thread 31, which access the next memory block 256. – Eduardo Reis May 09 '22 at 14:14
  • Now, I am still unsure about the first part of your answer about the granularity of the memory access. From the picture, I understand that the choice between 32- and 128-byte access it is due the setting of cacheing. However, what would be the case for the 1, 2, 4, 8, 16 or 64-Byte transactions? Is it based on the type of the pointer I have? I mean, the size pointer type? Like short int would yield 2-, int 4- and long int 8-byte transactions? Once again, thank you for sharing the CUDA Training Series. I am going through it and I am watching the 2 module now. – Eduardo Reis May 09 '22 at 14:19
  • 1
    32 bytes is the lowest granularity. Even if all threads access the same 1 byte or only one thread is active. If 32 threads (=1 warp) access contiguous memory locations of 1 byte each, you get 32 byte, with 4 bytes you get 128 bytes and with 16 bytes (e.g. float4) you get 512 bytes. As Robert said, those would be split into transactions of 128 bytes max. You could get a very small performance gain nevertheless. – Sebastian May 09 '22 at 14:45
  • 1
    @Sebastian, thank you so much! I hadn't realized that. Everything is clear to me now. It's kind of taking a while to adopt the parallel mindset and always remember the 32-threads working at the same time, but hopefully I will get there soon! – Eduardo Reis May 09 '22 at 15:06
  • You have the right mindset - trying to understand, what is happening, and not just trying to get a 'working' program. Try out the Compute Nsight GUI with some small kernels. Then you see, what is really happening. (See e.g. https://docs.nvidia.com/nsight-compute/ProfilingGuide/graphics/memory-chart-a100.png) – Sebastian May 09 '22 at 15:53