Strong scaling on GPUs

Question

I'd like to investigate the strong scaling of my parallel GPU code (written with OpenACC). The concept of strong scaling with GPUs is - at least as far as I know - more murky than with CPUs. The only resource I found regarding strong scaling on GPUs suggests fixing the problem size and increasing the number of GPUs. However, I believe there is some amount of strong scaling within GPUs, for example scaling over streaming multiprocessors (in the Nvidia Kepler architecture).

The intent of OpenACC and CUDA is to explicitly abstract away the hardware to the parallel programmer, constraining her to their three-level programming model with gangs (thread blocks), workers (warps) and vectors (SIMT group of threads). It is my understanding that the CUDA model aims at offering scalability with respect to its thread blocks, which are independent and are mapped to SMXs. I therefore see two ways to investigate strong scaling with the GPU:

Fix the problem size, and set the thread block size and number of threads per block to an arbitrary constant number. Scale the number of thread blocks (grid size).
Given additional knowledge on the underlying hardware (e.g. CUDA compute capability, max warps/multiprocessor, max thread blocks/multiprocessor, etc.), set the thread block size and number of threads per block such that a block occupies an entire and single SMX. Therefore, scaling over thread blocks is equivalent to scaling over SMXs.

My questions are: is my train of thought regarding strong scaling on the GPU correct/relevant? If so, is there a way to do #2 above within OpenACC?

score 5 · Accepted Answer · answered Dec 18 '14 at 17:01

GPUs do strong scale, but not necessarily in the way that you're thinking, which is why you've only been able to find information about strong scaling to multiple GPUs. With a multi-core CPU you can trivially decide exactly how many CPU cores you want to run on, so you can fix the work and adjust the degree of threading across the cores. With a GPU the allocation across SMs is handled automatically and is completely out of your control. This is by design, because it means that a well-written GPU code will strong scale to fill whatever GPU (or GPUs) you throw at it without any programmer or user intervention.

You could run on some small number of OpenACC gangs/CUDA threadblocks and assume that 14 gangs will run on 14 different SMs, but there's a couple of problems with this. First, 1 gang/threadblock will not saturate a single Kepler SMX. No matter how many threads, no matter what the occupancy, you need more blocks per SM in order to fully utilize the hardware. Second, you're not really guaranteed that the hardware will choose to schedule the blocks that way. Finally, even if you find the optimal number of blocks or gangs per SM on the device you have, it won't scale to other devices. The trick with GPUs is to expose as much parallelism as possible so that you can scale from devices with 1 SM up to devices with 100, if they ever exist, or to multiple devices.

If you want to experiment with how varying the number of OpenACC gangs for a fixed amount of work affects performance, you'd do that with either the num_gangs clause, if you're using a parallel region, or the gang clause, if you're using kernels. Since you're trying to force a particular mapping of the loops, you're really better off using parallel, since that's the more prescriptive directive. What you'd want to do is something like the following:

#pragma acc parallel loop gang vector num_gangs(vary this number) vector_length(fix this number)
for(i=0; i<N; i++)
  do something

This tells the compiler to vectorize the loop using some provided vector length and then partition the loop across OpenACC gangs. What I'd expect is that as you add gangs you'll see better performance up until some multiple of the number of SMs, at which point performance would become roughly flat (with outliers of course). As I said above, fixing the number of gangs at the point where you see optimal performance is not necessarily the best idea, unless this is the only device you're interested in. Instead, by either letting the compiler decide how to decompose the loop, which allows the compiler to make smart decisions based on the architecture you tell it to build for, or by exposing as many gangs as possible, which gives you additional parallelism that will strong scale to larger GPUs or multiple GPUs, you'd have more portable code.

Excellent answer, thanks! Is there a reason why you do not use the `worker` and `num_workers` clauses? — lodhb, Feb 26 '15 at 21:21
If there's sufficient vector parallelism on the innermost loop, then the worker is generally not necessary on Nvidia GPUs. The times that I've seen worker be useful is when the innermost loop doesn't contain very much parallelism, so adding worker parallelism multiplies the parallelism to fill out the threadblock. — jefflarkin, Mar 02 '15 at 15:33

score 0 · Answer 2 · answered Nov 14 '14 at 09:17

For occupying a complete SMX I would suggest using shared memory as limiting resource for occupancy. Write a kernel that consumes all 32kB of shared memory and the block will occupy the entire SMX, because the SMX is out of resources for another block. Than you can scale up your blocks from 1 to 13 (for K20c) and the scheduler will (hopefully) schedule each block to a different SMX. Than you can scale up the therads per block first to 192 to get each CUDA core busy and then you can go further to get the warp scheduler happy. GPUs provide performance through latency hiding. So you have to move on from 1 block occupies a SMX to N blocks. You can do that by using less shared memory. And again scaling up your warps to cover latency hiding.

I never touched OpenACC and if you really want full control over your experimental code use CUDA instead of OpenACC. You cannot see inside the OpenACC compiler and what it is doing with the pragmas used in your code.

"You cannot see inside the OpenACC compiler and what it is doing with the pragmas used in your code." With the PGI OpenACC compiler, you can specify to keep intermediate files produced. The PGI OpenACC compilers convert the relevant source code to intermediate CUDA C/C++ files. If you keep these, you can see *exactly* what CUDA kernels are being generated. OpenACC is a richly featured specification, and gives you control of the GPU approaching the level you can achieve in CUDA. — Robert Crovella, Nov 15 '14 at 22:29

Strong scaling on GPUs

2 Answers2