I'd like to investigate the strong scaling of my parallel GPU code (written with OpenACC). The concept of strong scaling with GPUs is - at least as far as I know - more murky than with CPUs. The only resource I found regarding strong scaling on GPUs suggests fixing the problem size and increasing the number of GPUs. However, I believe there is some amount of strong scaling within GPUs, for example scaling over streaming multiprocessors (in the Nvidia Kepler architecture).
The intent of OpenACC and CUDA is to explicitly abstract away the hardware to the parallel programmer, constraining her to their three-level programming model with gangs (thread blocks), workers (warps) and vectors (SIMT group of threads). It is my understanding that the CUDA model aims at offering scalability with respect to its thread blocks, which are independent and are mapped to SMXs. I therefore see two ways to investigate strong scaling with the GPU:
- Fix the problem size, and set the thread block size and number of threads per block to an arbitrary constant number. Scale the number of thread blocks (grid size).
- Given additional knowledge on the underlying hardware (e.g. CUDA compute capability, max warps/multiprocessor, max thread blocks/multiprocessor, etc.), set the thread block size and number of threads per block such that a block occupies an entire and single SMX. Therefore, scaling over thread blocks is equivalent to scaling over SMXs.
My questions are: is my train of thought regarding strong scaling on the GPU correct/relevant? If so, is there a way to do #2 above within OpenACC?