I am very new to OpenCL and am going through the Altera OpenCL examples. In their matrix multiplication example, they have used the concept of blocks, where dimensions of the input matrices are multiple of block size. Here's the code:
void matrixMult( // Input and output matrices
__global float *restrict C,
__global float *A,
__global float *B,
// Widths of matrices.
int A_width, int B_width)
{
// Local storage for a block of input matrices A and B
__local float A_local[BLOCK_SIZE][BLOCK_SIZE];
__local float B_local[BLOCK_SIZE][BLOCK_SIZE];
// Block index
int block_x = get_group_id(0);
int block_y = get_group_id(1);
// Local ID index (offset within a block)
int local_x = get_local_id(0);
int local_y = get_local_id(1);
// Compute loop bounds
int a_start = A_width * BLOCK_SIZE * block_y;
int a_end = a_start + A_width - 1;
int b_start = BLOCK_SIZE * block_x;
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
running_sum += A_local[local_y][k] * B_local[local_x][k];
}
}
// Store result in matrix C
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
}
Assume block size is 2, then: block_x
and block_y
are both 0; and local_x
and local_y
are both 0.
Then A_local[0][0]
would be A[0]
and B_local[0][0]
would be B[0]
.
Sizes of A_local
and B_local
are 4 elements each.
In that case, how would A_local
and B_local
access other elements of the block in that iteration?
Also would separate threads/cores be assigned for each local_x
and local_y
?