OpenMP Matrix Multiplication Issues

Question

I am trying to multiple the values of a matrix.

#include <stdio.h>
#include <omp.h>
#include <time.h>
#include <stdlib.h>
#include <omp.h>

#define N 2048
#define FactorIntToDouble 1.1; 
#define THREAD_NUM 4

double firstMatrix [N] [N] = {0.0};
double secondMatrix [N] [N] = {0.0};
double matrixMultiResult [N] [N] = {0.0};

// Sync
void matrixMulti() {
    for(int row = 0 ; row < N ; row++){
        for(int col = 0; col < N ; col++){
            double resultValue = 0;
            for(int transNumber = 0 ; transNumber < N ; transNumber++) {
                resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ;
            }

            matrixMultiResult [row] [col] = resultValue;
        
        }
    }
}


void matrixInit() {
    for(int row = 0 ; row < N ; row++ ) {
        for(int col = 0 ; col < N ;col++){
            srand(row+col);
            firstMatrix [row] [col] = ( rand() % 10 ) * FactorIntToDouble;
            secondMatrix [row] [col] = ( rand() % 10 ) * FactorIntToDouble;
        }
    }
    
}

// Parallel
void matrixMulti2(int start, int end) {
    printf("Op: %d - %d\n", start, end);
    for(int row = start ; row < end ; row++){
        for(int col = 0; col < N ; col++){
            double resultValue = 0;
            for(int transNumber = 0 ; transNumber < N ; transNumber++) {
                resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ;
            }

            matrixMultiResult [row] [col] = resultValue;
        }
    }
}


void process1(){
    clock_t t1 = clock(); 
    #pragma omp parallel
    {
    int thread = omp_get_thread_num();
    int thread_multi = N / 4;
    
    int start = (thread) * thread_multi;
        
        int end = 0;
        if(thread == (THREAD_NUM - 1)){
            end = (start + thread_multi);
        }else{
            end = (start + thread_multi) - 1;
        }
        
        matrixMulti2(start, end);
        
    
    }
    
    clock_t t2 = clock(); 
    printf("time 2: %ld\n", t2-t1);
}


int main(){
    matrixInit();
    
    clock_t t1 = clock(); 
    matrixMulti();
    
    clock_t t2 = clock(); 
    printf("time: %ld", t2-t1);
    
    process1();

    return 0;
}

I have both a parallel and sync version. But the parallel version is longer than the sync version.

Current the sync takes around 90 seconds and the parallel over 100. Which makes no sense to me.

My logic was to split the matrix into 4 parts from the first 4 statement. Which I believe is logical.

After I finish this part. I would like to figure out how to speed up this process for the parallel even more. Possibly using Strassen's Matrix Multiplication. I just don't know where to start or how to get to this point.

I've already spent around 5 hours trying to figure this out.

Why don't you use `#pragma omp parallel for` to let OpenMP distribute the loops for `row` and `col`? — Michael Klemm, Jul 21 '21 at 16:50
@MichaelKlemm I am extremely new to OpenMP. How would this work? — Bass Approved, Jul 21 '21 at 16:52
"I've already spent around 5 hours trying to figure this out." In that time you could have read up on BLAS Libraries and used one to give you a fast, well tested solution... https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms — Jim Cownie, Jul 22 '21 at 07:50

Laci · Accepted Answer · 2021-07-22T05:00:22.217

Here it is:

// Sync
void matrixMulti() {
    #pragma omp parallel for collapse(2)
    for(int row = 0 ; row < N ; row++){
        for(int col = 0; col < N ; col++){
            double resultValue = 0;
            for(int transNumber = 0 ; transNumber < N ; transNumber++) {
                resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ;
            }

            matrixMultiResult [row] [col] = resultValue;
        
        }
    }
}

Update: Here is what I got on an 8 core system using gcc 10.3 -O3 -fopenmp flags (I show you the program's output and result of linux time command) :

main() was changed to measure the time with omp_get_wtime() because in linux clock() measures processor time:

double t1 = omp_get_wtime(); 
matrixMulti();    
double t2 = omp_get_wtime(); 
printf("time: %f", t2-t1);

Serial program:

time: 25.895234

real    0m33.296s
user    0m33.139s
sys     0m0.152s

using: #pragma omp parallel for

 time: 3.573521

 real    0m11.120s
 user    0m32.205s
 sys     0m0.136s

using: #pragma omp parallel for collapse(2)

time: 5.466674

real    0m12.786s
user    0m49.978s
sys     0m0.248s

The results suggest that initialization of matrix takes ca. 8 s, so it may also be worth parallelizing. Without collapse(2) the program runs faster, so do not use collapse(2) clause. Note that on your system you may got different speed improvement or even decrease depending on your hardware. Speed of matrix multiplication strongly depends on the speed of memory read/write. Shared-Memory Multicore systems (i.e most PCs, laptops) may not show any speed increase upon parallelization of this program, but Distributed-Memory Multicore systems (i.e. high-end serves) definitely show performance increase. For more details please read e.g. this.

Update2: On Ryzen 7 5800X I got 41.6 s vs 1.68 s, which is a bigger increase than the number of cores. It is because more cache memory is available when all the cores is used.

I think the `clock()` measures processor time, not wall-clock time, use `omp_get_wtime()` — Laci, Jul 21 '21 at 17:48

OpenMP Matrix Multiplication Issues

1 Answers1