0

I'm trying to optimize my memory-bound numerical simulation kernel in OpenCL/SYCL using local memory to allow data sharing between workitems, so that I can reduce redundant global memory traffic.

When there's little to no data dependency within local memory, populating the local memory is simple - one can simply subdivide the local memory's index space and assign a portion to each workitem by using a suitable index calculation equation, so all workitems collectively load the data from global memory to local memory to the appropriate locations. Then one uses a barrier to separate the loading phase and compute phase of each workitem. Communication only occurs at the compute phase and only implicitly because they share memory - during the loading phase, no synchronization, communication, or even logic is needed.

However, my kernel has a non-trivial data dependency chain (figuring the data dependency out itself is an exercise of combinatorics), because this is the only way to allow a high degree of data reuse within the space constraint of local memory. The data to be loaded into local memory is subdivided into 2x2x2 blocks, and irregular data dependencies exist between these blocks. For example, to compute the next 3 blocks, you need the previous 1 block, to compute the next 6 blocks, you need the previous 3 blocks, to compute the next 10 blocks, you need to the previous 6 blocks, compute the next 15 blocks, you need the previous 10 blocks, etc. These blocks are needed to be "mapped" to workitems on the fly.

For considerations that are beyond the scope of this question, my plan right now is using a 1024-workitem group size to reach Occupancy 4 on AMD GCN while still being able to use 64 KiB of local memory. If each workitem is responsible for calculating a single point of the 2x2x2 blocks, it means to fully saturate the GPU, I must always load 128 blocks to local memory at a time, so the number of points being worked on is 2x2x2x128 = 1024.

This requirement greatly complicates the logic of local memory management:

  1. The local memory must be used as a ring buffer, I need to keep loading new blocks to local memory as soon as I retire old blocks. I need to manipulate several pointers and counters.

  2. Ideally, 128 blocks must be loaded into local memory at a time, but this is further restricted by data dependency.

  3. When loading and retiring blocks, the data dependency between adjacent iteration must not be broken.

In this case, what is the best way to populate and manage local memory?

It's impossible to divide the blocks in an regular and communication-free manner to each workitem. I've seen that a common solution in tutorials is using if/else so that only a single workitem is responsible for loading local memory or managing the state of the kernel. However, I feel that this greatly reduces the available global memory bandwidth, since only a single workitem (e.g. if (id == 0)) is making memory requests and this is insufficient to saturate the memory controller. It also introduces workitem divergence, which can be harmful to performance. A related solution is dividing the work at wavefront boundary (e.g. if (id < 64)) and should perform better.

In summary, my question is: what are the general strategies of managing a stateful global state or global data structure shared by all workitems in a workgroup?

比尔盖子
  • 2,693
  • 5
  • 37
  • 53
  • I can't say I fully understand your data dependency situation. If you can calculate the next 3 blocks knowing the previous block, then surely you never need to know any more than that. I suggest you create a minimal example which is analogous to your situation and ask for help optimising it. – Simon Goater Aug 16 '23 at 09:50
  • @SimonGoater I'm more interested in learning about general strategies for managing a global state shared by all workitems, regardless of the detail of the algorithm. The unique problem of GPU programming is that the kernel runs from the perspective of individual workitems, so it's unclear about how the global state of the entire workgroup should be managed - it ought to be a common pattern and I'm sure a lot of solutions have been developed for that, but I'm unable to find any clear description. – 比尔盖子 Aug 16 '23 at 10:04
  • @SimonGoater Speaking of my algorithm in particular, imagine a dependency tree with many layers, the next layer depends on the previous layer. The tree gradually grows and shrinks. A naive solution is storing the entire tree (all data blocks), but it uses too much memory. So a better solution is using a buffer only as wide as the two longest adjacent layers. But the problem is that, at the beginning of the computation, there are only few blocks to handle, wasting hardware resources. The solution is a ring buffer, thus the question is how to manage the global state of that buffer. – 比尔盖子 Aug 16 '23 at 10:11
  • I think if you can't keep the dependencies within the work items, then you would have to come up with a bespoke solution, if that is even possible. There are some things GPUs are good at, and some things they are not good at. You may have found one of the latter. As you mentioned, there are barriers, but making use of them efficiently might not be possible in this case. It all depends on the detail. – Simon Goater Aug 16 '23 at 10:23
  • @SimonGoater The problem is in fact embarrassingly parallel if one allows unrestricted access to global memory and local memory. My kernel already saw 1000% speedup over CPU just from GPU's massive memory bandwidth alone. But this formulation still generates too much redundant load/store traffic, and also uses too much workgroup local memory. The minimum-memory version of the algorithm reduces the local memory requirement to 1%, at the cost of turning a single-pass algorithm into a multi-pass one with dependencies to the previous pass. – 比尔盖子 Aug 16 '23 at 10:31
  • @SimonGoater The best solution I'm seeing now, after rethinking about the problem, is pre-planning. Since the size and dependency of the blocks are fixed, all the ring buffer pointer offset, and memory loads/stores performed by workitems can be pre-calculated, instead of being managed as a global state at runtime. If this approach works, the final kernel would contain either unrolled loops or lookup tables, with no complicated logic. – 比尔盖子 Aug 16 '23 at 10:34
  • I guess you must be using a Vega or Instinct GPU with HBM2.0 memory then? It's interesting to hear that the bandwidth isn't just hype. – Simon Goater Aug 16 '23 at 10:38
  • @SimonGoater Yep, Radeon VII / Instinct MI50 / Vega20. I found its 4096-bit HBM2 memory is great for memory-bound, data-streaming type kernels with a lot of DRAM I/O but very little computation per iteration. This kind of workload is uncommon in graphics and does not benefit too many applications, but it's the backbone of many physics simulation code. – 比尔盖子 Aug 16 '23 at 10:49

2 Answers2

0

After rethinking about the problem, the best solution I'm seeing right now is pre-planning. Since a workgroup only processes a fixed number of blocks, with a fixed size and dependency chain, perhaps the actions performed by each workitem can be per-calculated by the kernel programmer during coding or before kernel launch - instead of asking the workgroup to figure that out at runtime.

Using this method, all the ring buffer manipulations, including pointer and memory loads/stores performed by each workitem are pre-determined. If this approach works, in the final kernel, each workitem would follow the pre-planned path using either unrolled loops or lookup tables, without any other complicated logic to maintain the global workgroup state. Only a few barriers between block retirements and ring buffer flushes are needed.

比尔盖子
  • 2,693
  • 5
  • 37
  • 53
0

Sort the workitems on their dependencies. When they are sorted, they can use cache efficiently, making redundant loading quick since they belong to same compute-unit (L1 cache).

For example, if you are building a neighbor-list of particles in a volumetric computation logic, you can sort the workitems on their neighbor-id values such that after sorting they access particles with similar id values at the same time, this is good for L1/L2 caching and you don't have to explicitly use local-memory.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • On AMD GCN, L1 cache is 16 KiB per workgroup but LDS is 64 KiB per workgroup, I should be able to achieve better performance using local memory explicitly. – 比尔盖子 Aug 18 '23 at 20:38
  • Could you write some pseudo-code so that it becomes easier to guess the algorithm? – huseyin tugrul buyukisik Aug 18 '23 at 20:41
  • I'm already working on a script to "compile" (so to speak) a static lookup table to control the local memory load/store for each workitem based on data dependency, so this way I can eliminate the complex runtime buffer management logic. Time will tell if this idea really works. – 比尔盖子 Aug 18 '23 at 20:45