Using openmp to distribute matrix multiplication work across multiple GPUs via openacc using C

Question

I am trying to distribute the work of multiplying two NxN matrices across 3 nVidia GPUs using 3 OpenMP threads. (The matrix values will get large hence the long long data type.) However I am having trouble placing the #pragma acc parallel loop in the correct place. I have used some examples in the nVidia PDFs shared but to no luck. I know that the inner most loop cannot be parallelized. But I would like each of the three threads to own a GPU and do a portion of the work. Note that input and output matrices are defined as global variables as I kept running out of stack memory.

I have tried the code below, but I get compilation errors all pointing to line 75 which is the #pragma acc parallel loop line

[test@server ~]pgcc -acc -mp -ta=tesla:cc60 -Minfo=all -o testGPU matrixMultiplyopenmp.c

PGC-S-0035-Syntax error: Recovery attempted by replacing keyword for by keyword barrier (matrixMultiplyopenmp.c: 75)

PGC-S-0035-Syntax error: Recovery attempted by replacing acc by keyword enum (matrixMultiplyopenmp.c: 76)

PGC-S-0036-Syntax error: Recovery attempted by inserting ';' before keyword for (matrixMultiplyopenmp.c: 77)

PGC/x86-64 Linux 18.10-1: compilation completed with severe errors

Function is:

void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
    // Get Nvidia device type
    acc_init(acc_device_nvidia);

    // Get Number of GPUs in system
    int num_gpus = acc_get_num_devices(acc_device_nvidia);

    //Set the number of OpenMP thread to the number of GPUs
    #pragma omp parallel num_threads(num_gpus)
    {
        //Get thread openMP number and set the GPU device to that number
        int threadNum = omp_get_thread_num();
        acc_set_device_num(threadNum, acc_device_nvidia);

        int row;
        int col;
        int key;

        #pragma omp for
        #pragma acc parallel loop
        for (row = 0; row < SIZE; row++)
            for (col = 0; col < SIZE; col++)
                for (key = 0; key < SIZE; key++)
                    matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
    }
}

#pragma omp for AND #pragma acc parallel loop have to be tightly bonded to the following for loop. This said, you may not need #pragma omp for at all. You've already the omp parallel thread and the GPU to thread assignment. Compare to this little old tutorial slides: https://www.hkhlr.de/sites/default/files/field_download_file/2013_April_11_12_wienke.pdf — fisehara, Apr 20 '19 at 20:19

score 1 · Answer 1 · answered Apr 22 '19 at 18:46

As fisehara points out, you can't have both an OpenMP "for" loop combined with an OpenACC parallel loop on the same for loop. Instead, you need to manually decompose the work across the OpenMP threads. Example below.

Is there a reason why you want to use multiple GPUs here? Most likely the matrix multiply will fit on to a single GPU so there's no need for the extra overhead of introducing host-side parallelization.

Also, I generally recommend using MPI+OpenACC for multi-gpu programming. Domain decomposition is naturally part of MPI but not inherent in OpenMP. Also, MPI gives you a one-to-one relationship between the host process and accelerator, allows for scaling beyond a single node, and you can take advantage of CUDA Aware MPI for direct GPU to GPU data transfers. For more info, do a web search for "MPI OpenACC" and you'll find several tutorials. Class #2 at https://developer.nvidia.com/openacc-advanced-course is a good resource.

% cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#ifdef _OPENACC
#include <openacc.h>
#endif

#define SIZE 130

void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{

#ifdef _OPENACC
    // Get Nvidia device type
    acc_init(acc_device_nvidia);
    // Get Number of GPUs in system
    int num_gpus = acc_get_num_devices(acc_device_nvidia);
#else
    int num_gpus = omp_get_max_threads();
#endif
    if (SIZE<num_gpus) {
        num_gpus=SIZE;
    }
    printf("Num Threads: %d\n",num_gpus);

    //Set the number of OpenMP thread to the number of GPUs
    #pragma omp parallel num_threads(num_gpus)
    {
        //Get thread openMP number and set the GPU device to that number
        int threadNum = omp_get_thread_num();
#ifdef _OPENACC
        acc_set_device_num(threadNum, acc_device_nvidia);
        printf("THID %d using GPU: %d\n",threadNum,threadNum);
#endif
        int row;
        int col;
        int key;
        int start, end;
        int block_size;
        block_size = SIZE/num_gpus;
        start = threadNum*block_size;
        end = start+block_size;
        if (threadNum==(num_gpus-1)) {
           // add the residual to the last thread
           end = SIZE;
        }
        printf("THID: %d, Start: %d End: %d\n",threadNum,start,end-1);

        #pragma acc parallel loop \
          copy(matrixProduct[start:end-start][:SIZE]), \
          copyin(matrixA[start:end-start][:SIZE],matrixB[:SIZE][:SIZE])
        for (row = start; row < end; row++) {
            #pragma acc loop vector
            for (col = 0; col < SIZE; col++) {
                for (key = 0; key < SIZE; key++) {
                    matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
        }}}
    }
}

int main() {
   long long int matrixA[SIZE][SIZE];
   long long int matrixB[SIZE][SIZE];
   long long int matrixProduct[SIZE][SIZE];
   int i,j;
   for(i=0;i<SIZE;++i) {
     for(j=0;j<SIZE;++j) {
        matrixA[i][j] = (i*SIZE)+j;
        matrixB[i][j] = (j*SIZE)+i;
        matrixProduct[i][j]=0;
     }
   }
   multiplyMatrix(matrixA,matrixB,matrixProduct);
   printf("Result:\n");
   for(i=0;i<SIZE;++i) {
      printf("%d: %ld %ld\n",i,matrixProduct[i][0],matrixProduct[i][SIZE-1]);
   }

}
% pgcc test.c -mp -ta=tesla -Minfo=accel,mp
multiplyMatrix:
     28, Parallel region activated
     49, Generating copyin(matrixB[:130][:])
         Generating copy(matrixProduct[start:end-start][:131])
         Generating copyin(matrixA[start:end-start][:131])
         Generating Tesla code
         52, #pragma acc loop gang /* blockIdx.x */
         54, #pragma acc loop vector(128) /* threadIdx.x */
         55, #pragma acc loop seq
     54, Loop is parallelizable
     55, Complex loop carried dependence of matrixA->,matrixProduct->,matrixB-> prevents parallelization
         Loop carried dependence of matrixProduct-> prevents parallelization
         Loop carried backward dependence of matrixProduct-> prevents vectorization
     59, Parallel region terminated
% a.out
Num Threads: 4
THID 0 using GPU: 0
THID: 0, Start: 0 End: 31
THID 1 using GPU: 1
THID: 1, Start: 32 End: 63
THID 3 using GPU: 3
THID: 3, Start: 96 End: 129
THID 2 using GPU: 2
THID: 2, Start: 64 End: 95
Result:
0: 723905 141340355
1: 1813955 425843405
2: 2904005 710346455
3: 3994055 994849505
...
126: 138070205 35988724655
127: 139160255 36273227705
128: 140250305 36557730755
129: 141340355 36842233805

FastGTR · Answer 2 · 2020-01-03T02:35:35.777

I ran into an issue with MPI+OpenACC compilation on the shared system I was restricted to and could not upgrade the compiler. The solution I ended up using, was breaking the work down with OMP first then calling an OpenACC function as follows:

//Main code
pragma omp parallel num_threads(num_gpus)
    {
        #pragma omp for private(tid)
        for (tid = 0; tid < num_gpus; tid++)
        {
            //Get thread openMP number and set the GPU device to that number
            int threadNum = omp_get_thread_num();
            acc_set_device_num(threadNum, acc_device_nvidia);

            // check with thread is using which GPU
            int gpu_num = acc_get_device_num(acc_device_nvidia);
            printf("Thread # %d is going to use GPU # %d \n", threadNum, gpu_num);

            //distribute the uneven rows
            if (threadNum < extraRows)
            {
                startRow = threadNum * (rowsPerThread + 1);
                stopRow = startRow + rowsPerThread;
            }
            else
            {
                startRow = threadNum * rowsPerThread + extraRows;
                stopRow = startRow + (rowsPerThread - 1);
            }
            // Debug to check allocation of data to threads
            //printf("Start row is %d, and Stop rows is %d \n", startRow, stopRow);

            GPUmultiplyMatrix(matrixA, matrixB, matrixProduct, startRow, stopRow);
        }
    }
void GPUmultiplyMatrix(long long int matrixA[SIZE][SIZE], long long int 
matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE], int 
startRow, int stopRow)
    {
        int row;
        int col;
        int key;

        #pragma acc parallel loop collapse (2)
        for (row = startRow; row <= stopRow; row++)
            for (col = 0; col < SIZE; col++)
                for (key = 0; key < SIZE; key++)
                    matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
    }

Using openmp to distribute matrix multiplication work across multiple GPUs via openacc using C

2 Answers2