3

I'm trying to use shared memory to cache things with OpenACC.

Basically what I'm working on is a matrix multiplication, and what I have is this:

typedef float ff; 

// Multiplies two square row-major matrices a and b, puts the result in c. 
void mmul(const restrict ff* a, 
          const restrict ff* b, 
          restrict ff* c, 
          const int n) { 
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n]) 
{ 

    #pragma acc region 
    { 

        #pragma acc loop independent vector(16) 
        for (int i = 0; i < n; ++i) { 
            #pragma acc loop independent vector(16) 
            for (int j = 0; j < n; ++j) { 
                ff sum = 0; 
                    for (int k = 0; k < n; ++k) { 
                        sum += a[i + n * k] * b[k + n * j]; 
                    } 
                    c[i + n * j] = sum; 
                } 
            } 
        } 
    }
}

What I would like to do is use shared memory to cache tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.

Basically in CUDA I would know the exact size of my blocks, and would be able to:

  • declare a shared memory with the size of the block
  • copy the 'relevant' part of the data to the block
  • use this data

I understand I can use the

#pragma acc cached

directive, and that I can specify block sizes with the vector and gang options, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.

Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?

paleonix
  • 2,293
  • 1
  • 13
  • 29
leo
  • 1,117
  • 1
  • 8
  • 18
  • 1
    The PGI accelerator compiler may be using shared memory already. Have you inspected the output with the -Minfo switch? This [tutorial](http://developer.nvidia.com/cuda/openacc-example-part-1) may be of interest. – Robert Crovella Oct 17 '12 at 05:06
  • 1
    Yes, but the Minfo switch only tells me HOW much shared memory my implementation is using. While this is useful, I was more interested in knowing if there's a way of **explicitly** manipulating such memory. Being able to see the high level cuda generated is very helpful though. – leo Oct 17 '12 at 18:43
  • @leo did you find an answer to your question? Were you able to explicitly define shared memory in OpenACC? – mgNobody May 09 '17 at 04:07

1 Answers1

4

If you are using PGI Accelerator Compiler, you can dump out the generated PTX file and see what is going on in underling of execution:

pgcc -acc -fast -Minfo -ta=nvidia,cc13,keepptx matrixMult.c -o matrixMult

The generated PTX will be stored in the current directory.

EDIT: You may prefer to see the high-level code (CUDA for C or Fortran). So use following -ta=nvidia,cc13,keepptx,keepgpu .

lashgar
  • 5,184
  • 3
  • 37
  • 45