I have following working code for matrix multiplication (with tiling) which is producing results correctly:
void ProcessTile(int d1stride, int d2stride, int d3stride) {
for (int i=d1stride ; i<(d1stride+BLOCKSIZE) ; i++) //covering rows of tile
{
for (int j=d2stride ; j<(d2stride+BLOCKSIZE) ; j++) //covering columns of tile (or rows if transpose)
{
__m128 vsum = _mm_set1_ps(0.0f);
for (int k=d3stride ; k<(d3stride+BLOCKSIZE); k+=4) //covering convolution of vectors in a tile
{
__m128 vx = _mm_load_ps(&x2[i][k]); //loads up 4 floats into a __m128
__m128 vy = _mm_load_ps(&y2[j][k]); //loads up 4 floats into a __m128
vsum = _mm_add_ps(vsum, _mm_mul_ps(vx, vy));
}
vsum = _mm_hadd_ps(vsum, vsum);
vsum = _mm_hadd_ps(vsum, vsum);
vsum = _mm_add_ps(vsum, _mm_load_ss (&r2[i][j]));
_mm_store_ss(&r2[i][j], vsum);
}
}
}
MatrixMultiply()
{
#pragma omp parallel
{
#pragma omp single
{
for(i=0 ; i<N ; i+=BLOCKSIZE) //covering tiled-rows of entire matrix
{
for(j=0 ; j<N ; j+=BLOCKSIZE) // covering tiled-columns of entire matrix
{
#pragma omp task firstprivate(i,j,k)
for(k=0 ; k<N ; k+=BLOCKSIZE) //covering tiled-multiplication of two matrix-vectors
{
ProcessTile(i,j,k);
}
}
}
}
}
}
However I am looking to do tiling on fine granularity level which is not working at the moment. I'll explain the working/approach with a snapshot attached matrix multiplication with tiling diagram
The purpose of the above code is to generate OpenMP tasks and actual work is happening in ProcessTile(...). Snapshot has 8x8 matrix with BLOCKSIZE of 4 (i.e. 4 tiles in all matrices and each tile is 2x2 size).
The working scenario which is producing expected correct results and performance
C[0][0] = (A[0][0]xB[0][0]) + (A[0][1]xB[1][0]) //This is tiled multiplication i.e. A[0][0] is 2x2 matrix etc.
The scenario which is not working and I am looking to make work is to change innermost task generating loop from
#pragma omp task firstprivate(i,j,k)
for(k=0 ; k<N ; k+=BLOCKSIZE) //covering tiled-multiplication of two matrix-vectors
{
ProcessTile(i,j,k);
}
to
for(k=0 ; k<N ; k+=BLOCKSIZE) //covering tiled-convolution of two matrix-vectors
{
#pragma omp task firstprivate(i,j,k)
ProcessTile(i,j,k);
}
Above change results in the following generation of tasks and tilled multiplication
C[0][0] += (A[0][0]xB[0][0]) //tiled multiplication done by 'task X'
C[0][0] += (A[0][1]xB[1][0]) //tiled multiplication done by 'task Y'
My answers go wrong when I made the above change. The reason I am looking to do this way to have independent memory regions for each task/thread inside each matrix.
Both tasks are doing accumulative operation so not sure why getting wrong answers. What I might be missing? Is there any way I can communicate info to the task generation code which goes along the lines of 'task Y' starts after when 'task X' finishes? I am not sure if such dependency is required for such an accumulative operation or a good way to solve the problem. Maybe there is another way.