0

I have following working code for matrix multiplication (with tiling) which is producing results correctly:

void ProcessTile(int d1stride, int d2stride, int d3stride) {

        for (int i=d1stride ; i<(d1stride+BLOCKSIZE) ; i++) //covering rows of tile
        {
            for (int j=d2stride ; j<(d2stride+BLOCKSIZE) ; j++) //covering columns of tile (or rows if transpose)
            {
                __m128 vsum = _mm_set1_ps(0.0f);
                for (int k=d3stride ; k<(d3stride+BLOCKSIZE); k+=4) //covering convolution of vectors in a tile
                {
                    __m128 vx = _mm_load_ps(&x2[i][k]);  //loads up 4 floats into a __m128
                    __m128 vy = _mm_load_ps(&y2[j][k]);  //loads up 4 floats into a __m128
                    vsum = _mm_add_ps(vsum, _mm_mul_ps(vx, vy));
                }
                vsum = _mm_hadd_ps(vsum, vsum);
                vsum = _mm_hadd_ps(vsum, vsum);
                vsum = _mm_add_ps(vsum, _mm_load_ss (&r2[i][j]));
                _mm_store_ss(&r2[i][j], vsum);
            }
        }
}

MatrixMultiply()
{    
#pragma omp parallel
    {
        #pragma omp single
        {
            for(i=0 ; i<N ; i+=BLOCKSIZE) //covering tiled-rows of entire matrix
            {
                for(j=0 ; j<N ; j+=BLOCKSIZE) // covering tiled-columns of entire matrix
                {
                    #pragma omp task firstprivate(i,j,k)
                    for(k=0 ; k<N ; k+=BLOCKSIZE) //covering tiled-multiplication of two matrix-vectors
                    {
                        ProcessTile(i,j,k);
                    }
                }
            }
        }
    }

}

However I am looking to do tiling on fine granularity level which is not working at the moment. I'll explain the working/approach with a snapshot attached matrix multiplication with tiling diagram

The purpose of the above code is to generate OpenMP tasks and actual work is happening in ProcessTile(...). Snapshot has 8x8 matrix with BLOCKSIZE of 4 (i.e. 4 tiles in all matrices and each tile is 2x2 size).

The working scenario which is producing expected correct results and performance

C[0][0] = (A[0][0]xB[0][0]) + (A[0][1]xB[1][0]) //This is tiled multiplication i.e. A[0][0] is 2x2 matrix etc.

The scenario which is not working and I am looking to make work is to change innermost task generating loop from

#pragma omp task firstprivate(i,j,k)
for(k=0 ; k<N ; k+=BLOCKSIZE) //covering tiled-multiplication of two matrix-vectors
{
    ProcessTile(i,j,k);
}

to

for(k=0 ; k<N ; k+=BLOCKSIZE) //covering tiled-convolution of two matrix-vectors
{
   #pragma omp task firstprivate(i,j,k)
   ProcessTile(i,j,k);
} 

Above change results in the following generation of tasks and tilled multiplication

C[0][0] += (A[0][0]xB[0][0]) //tiled multiplication done by 'task X'

C[0][0] += (A[0][1]xB[1][0]) //tiled multiplication done by 'task Y'

My answers go wrong when I made the above change. The reason I am looking to do this way to have independent memory regions for each task/thread inside each matrix.

Both tasks are doing accumulative operation so not sure why getting wrong answers. What I might be missing? Is there any way I can communicate info to the task generation code which goes along the lines of 'task Y' starts after when 'task X' finishes? I am not sure if such dependency is required for such an accumulative operation or a good way to solve the problem. Maybe there is another way.

User852
  • 35
  • 7
  • 1
    Because now you have multiple tasks sharing the same value of `i` and `j` and you need to protect from concurrent updates to the corresponding element of `C`. That's my best guess without knowing what's exactly the code behind `ProcessTile()`. – Hristo Iliev May 23 '20 at 10:00
  • Note that an easy way to protect your code is to use *task dependencies* on `C` memory blocks. However, this is probably not the most efficient way to do it. – Jérôme Richard May 23 '20 at 10:29
  • I have added ProcessTile function as well. Q1) What is the best way to make task dependent: #pragma omp task firstprivate(i,j,k) depend(in:i) depend(in:j) just above the ProcessTile. Q2: What are other options from the point of view of efficiency? – User852 May 24 '20 at 07:08
  • Q1) `depend(in:i) depend(in:j)` does not do what you want and will probably sentimentalize all the tasks! You should rather put an array like `depend(inout:res[i*size+j]) depend(in:a[i*size+k], b[k*size+j])` where i, j and k are block offsets. Q2) I advise you to read [this tutorial](https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0) before trying to parallelize efficiently this code. – Jérôme Richard May 27 '20 at 17:37

0 Answers0