0

I am currently trying to implement a simple matrix multiplication of 2 nxn matrices using OpenMP target offloading. The code is taken from here:

template<typename T>
void multiplyJIK(T *A, T *B, T *C, uint64_t size) {

    #pragma omp target data device(0) map(to: A[0:size*size], B[0:size * size], size) map(tofrom:     C[0:size * size])
    {
        #pragma omp target teams device(0) num_teams(32768) thread_limit(512) \
            map(to: A[0:size*size], B[0:size * size], size) map(tofrom: C[0:size * size]) \
            default(none) shared(A, B, C, size)

        #pragma omp distribute parallel for num_threads(512) dist_schedule(static, 512) \
            default(none) shared(A, B, C, size)
    
        for (uint64_t j = 0; j < size; ++j) {
            for (uint64_t i = 0; i < size; ++i) {
                for (uint64_t k = 0; k < size; ++k) {
                    C[i * size + j] += A[i * size + k] * B[k * size + j];
                }
            }
        }
    }
}

It should multiply the 2 matrices A and B and store the results in C. The matrices are represented as onedimensional arrays of length size * size.

For my test, T is a float and I try to compile the code using the nvhpc toolkit: nvc++ -std=c++17 -mp=gpu -target=gpu main.cpp -o matmul and get this error:

error: item must appear in a SHARED or PRIVATE clause:
                          C[i * size + j] += A[i * size + k] * B[k * size + j];
                          ^
       detected during instantiation of "void Target::multiplyJIK(T *, T *, T *, uint64_t) [with T=float]"

I dont understand this error as the C array should be correctly mapped (map(tofrom: C...)) and is present in the shared(...) clause. Am I missing something in the code or is this a problem with the compile flags?

Dogyman
  • 57
  • 8
  • You are trying to multiply a matrix using 512 threads. That would mean at a size of 8 each thread does a single multiplication. I would assume the size is significantly larger or multithreading just wouldn't be worth it. But then why are you using an `O(n^3)` matrix multiplication instead of one of the more efficient algorithms? – Goswin von Brederlow Jun 06 '22 at 13:46
  • @GoswinvonBrederlow there are indeed more efficient algorithms but currently I am just trying to get this slower version to work. My test matrices are 4096*4096 so there is enough data to work with. But all of this should have no impact on my compile error. – Dogyman Jun 06 '22 at 13:57
  • IDK for the error but why do you want to do that? The code is very very inefficient for GPUs (it is already pretty inefficient for CPUs) starting from the uneeded transfer of `C` and the very SIMD-unfriendly operations. There are BLAS for that and there are GPUs variants of BLAS (like cuBLAS) which are heavily optimized. Doing that efficiently on a GPU is far from being simple (one should use tiling, shared-memory and apply a register blocking approach). – Jérôme Richard Jun 06 '22 at 14:05

0 Answers0