I am trying to distribute the work of multiplying two NxN matrices across 3 nVidia GPUs using 3 OpenMP threads. (The matrix values will get large hence the long long data type.) However I am having trouble placing the #pragma acc parallel loop
in the correct place. I have used some examples in the nVidia PDFs shared but to no luck. I know that the inner most loop cannot be parallelized. But I would like each of the three threads to own a GPU and do a portion of the work. Note that input and output matrices are defined as global variables as I kept running out of stack memory.
I have tried the code below, but I get compilation errors all pointing to line 75 which is the #pragma acc parallel loop
line
[test@server ~]pgcc -acc -mp -ta=tesla:cc60 -Minfo=all -o testGPU matrixMultiplyopenmp.c
PGC-S-0035-Syntax error: Recovery attempted by replacing keyword for by keyword barrier (matrixMultiplyopenmp.c: 75)
PGC-S-0035-Syntax error: Recovery attempted by replacing acc by keyword enum (matrixMultiplyopenmp.c: 76)
PGC-S-0036-Syntax error: Recovery attempted by inserting ';' before keyword for (matrixMultiplyopenmp.c: 77)
PGC/x86-64 Linux 18.10-1: compilation completed with severe errors
Function is:
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
int row;
int col;
int key;
#pragma omp for
#pragma acc parallel loop
for (row = 0; row < SIZE; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}
}