0

I am working with openacc using pgi compiler. I want to know how I can profile the code about memory usage specially the shared memory at runtime?

Thank you so much for your help!

Behzad

behzad baghapour
  • 127
  • 2
  • 11
  • which shared memory do you mean? The [OpenACC specification](http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf) primarily uses "shared memory" to distinguish between two types of accelerators: 1. Accelerators for which the host and device memory is shared called a "shared memory accelerator" 2. Accelerators for which the host and device memory are separate - a "non-shared memory accelerator" ? Or do you mean "shared memory as it is defined by CUDA? Or perhaps you mean OpenCL 2.0 Shared Virtual Memory ? – Robert Crovella Aug 15 '15 at 13:43
  • Thanks for your response. I mean the shared memory in CUDA sense. The one that is cached in OpenACC (!$acc cache) – behzad baghapour Aug 18 '15 at 17:02

1 Answers1

1

I'm assuming you mean "shared memory" in the CUDA sense (the fast, per-SM shared memory on NVIDIA GPUs). In this case, you have a few options.

First, if you just want to know how much shared memory is being used, this can be determined at compile-time by adding -Mcuda=ptxinfo.

pgcc -fast -ta=tesla:cc35 laplace2d.c -Mcuda=ptxinfo
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_61_gpu' for 'sm_35'
ptxas info    : Function properties for main_61_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 26 registers, 368 bytes cmem[0]
ptxas info    : Compiling entry function 'main_65_gpu_red' for 'sm_35'
ptxas info    : Function properties for main_65_gpu_red
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 368 bytes cmem[0]
ptxas info    : Compiling entry function 'main_72_gpu' for 'sm_35'
ptxas info    : Function properties for main_72_gpu
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 344 bytes cmem[0]

In the above case, it doesn't appear that I'm using any shared memory. (Follow-up I spoke with a PGI compiler engineer and learned that the shared memory is dynamically adjusted at kernel launch, so it will not show up via ptxinfo.)

You can also use NVIDIA Visual profiler to get at this information. If you gather a GPU timeline, then click on an instance of a particular kernel, the properties panel should open and show shared memory/block. In my case, the above showed 0 bytes of shared memory used and the Visual Profiler showed some memory being used, so I'll need to dig into why.

You can get some info at runtime too. If you're comfortable on the command-line, you can use nvprof:

# Analyze load/store transactions
$ nvprof -m shared_load_transactions,shared_store_transactions ./a.out
# Analyze shared memory efficiency
# This will result in a LOT of kernel replays.
$ nvprof -m shared_efficiency ./a.out

This doesn't show the amount used, but does give you an idea of how it's used. The Visual Profiler's guided analysis will give you some insight into what these metrics mean.

jefflarkin
  • 1,279
  • 6
  • 14
  • Thanks a lot Jeff for your detailed information. I checked the code and there is no shared memory (cache) usage. However, when I add !$acc cache for one of the vectors in the inner loop, the performance degenerates (computational time increases and GFLOPs decreases). I know how to implement shared memory in CUDA but I have no idea to efficiently implement it in OpenACC. – behzad baghapour Aug 18 '15 at 17:08
  • One thing I've seen at times is that if I'm using the `kernels` directive the compiler will sometimes automatically add caching to the code (you can see it in the `-Minfo` messages), but if you use `parallel loop` you'll generally have to use the `cache` directive explicitly. If you have arrays that are private within `gang` loops, you're more likely to get the arrays cached (and more likely to be able to get the `cache` directive working). – jefflarkin Aug 18 '15 at 17:11
  • If you have a CUDA kernel in mind when you're writing OpenACC, especially when you're wanting to implement these sorts of lower-level optimizations, it can be very frustrating. See my earlier comment about tricks to getting `cache` working. Optimization with OpenACC can be a bit hand-wavy at times, with a lot of guessing and finger crossing. We're working on making that better. – jefflarkin Aug 18 '15 at 17:14
  • Thanks Jeff. I am actually looking for some documents to help me using shared memory explicitly (manually) in an openacc code. I would really appreciate if you introduce me some resources for this case. – behzad baghapour Aug 19 '15 at 20:57
  • You're right, the cache directive is not very well documented and can be mysterious to use. I'll see if I can find someone to help you in your follow-up question. – jefflarkin Aug 20 '15 at 16:21
  • I really appreciate your help. – behzad baghapour Aug 21 '15 at 13:30