I am in the process of trying to optimize some matrix-matrix multiplication benchmark code that uses OpenMP on a MAESTRO processor. The MAESTRO has 49 processors arranged in a two-dimensional array in a 7x7 configuration. Each core has its own L1 and L2 cache. A layout of the board can be seen here: https://i.stack.imgur.com/RG0fC.png.
My main question is: Can different data types (char vs short vs int, etc.) directly impact the performance of OpenMP code on NUMA-based processors? If so, is there a way to alleviate it? Below is my explanation of why I am asking this.
I was given a set of benchmarks that had been used by a research group to measure the performance of a given processor. The benchmarks resulted in increased performance for other processors, but they ran into the issue of not seeing the same type of results when running them on the MAESTRO. Here is a snippet of the matrix multiplication benchmark from the base code I received:
Relevant macros from header file (MAESTRO is 64-bit):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <cblas.h>
#include <omp.h>
//set data types
#ifdef ARCH64
//64-bit architectures
#define INT8_TYPE char
#define INT16_TYPE short
#define INT32_TYPE int
#define INT64_TYPE long
#else
//32-bit architectures
#define INT8_TYPE char
#define INT16_TYPE short
#define INT32_TYPE long
#define INT64_TYPE long long
#endif
#define SPFP_TYPE float
#define DPFP_TYPE double
//setup timer
//us resolution
#define TIME_STRUCT struct timeval
#define TIME_GET(time) gettimeofday((time),NULL)
#define TIME_DOUBLE(time) (time).tv_sec+1E-6*(time).tv_usec
#define TIME_RUNTIME(start,end) TIME_DOUBLE(end)-TIME_DOUBLE(start)
//select random seed method
#ifdef FIXED_SEED
//fixed
#define SEED 376134299
#else
//based on system time
#define SEED time(NULL)
#endif
32-bit integer matrix multiplication benchmark:
double matrix_matrix_mult_int32(int size,int threads)
{
//initialize index variables, random number generator, and timer
int i,j,k;
srand(SEED);
TIME_STRUCT start,end;
//allocate memory for matrices
INT32_TYPE *A=malloc(sizeof(INT32_TYPE)*(size*size));
INT32_TYPE *B=malloc(sizeof(INT32_TYPE)*(size*size));
INT64_TYPE *C=malloc(sizeof(INT64_TYPE)*(size*size));
//initialize input matrices to random numbers
//initialize output matrix to zeros
for(i=0;i<(size*size);i++)
{
A[i]=rand();
B[i]=rand();
C[i]=0;
}
//serial operation
if(threads==1)
{
//start timer
TIME_GET(&start);
//computation
for(i=0;i<size;i++)
{
for(k=0;k<size;k++)
{
for(j=0;j<size;j++)
{
C[i*size+j]+=A[i*size+k]*B[k*size+j];
}
}
}
//end timer
TIME_GET(&end);
}
//parallel operation
else
{
//start timer
TIME_GET(&start);
//parallelize with OpenMP
#pragma omp parallel for num_threads(threads) private(i,j,k)
for(i=0;i<size;i++)
{
for(k=0;k<size;k++)
{
for(j=0;j<size;j++)
{
C[i*size+j]+=A[i*size+k]*B[k*size+j];
}
}
}
//end timer
TIME_GET(&end);
}
//free memory
free(C);
free(B);
free(A);
//compute and return runtime
return TIME_RUNTIME(start,end);
}
Running the above benchmark serially resulted in better performance than running it with OpenMP. I was tasked with optimizing the benchmark for the MAESTRO to get better performance. Using the following code, I was able to get a performance increase:
double matrix_matrix_mult_int32(int size,int threads)
{
//initialize index variables, random number generator, and timer
int i,j,k;
srand(SEED);
TIME_STRUCT start,end;
//allocate memory for matrices
alloc_attr_t attrA = ALLOC_INIT;
alloc_attr_t attrB = ALLOC_INIT;
alloc_attr_t attrC = ALLOC_INIT;
alloc_set_home(&attrA, ALLOC_HOME_INCOHERENT);
alloc_set_home(&attrB, ALLOC_HOME_INCOHERENT);
alloc_set_home(&attrC, ALLOC_HOME_TASK);
INT32_TYPE *A=alloc_map(&attrA, sizeof(INT32_TYPE)*(size*size));
INT32_TYPE *B=alloc_map(&attrB, sizeof(INT32_TYPE)*(size*size));
INT64_TYPE *C=alloc_map(&attrC, sizeof(INT64_TYPE)*(size*size));
#pragma omp parallel for num_threads(threads) private(i)
for(i=0;i<(size*size);i++)
{
A[i] = rand();
B[i] = rand();
C[i] = 0;
tmc_mem_flush(&A[i], sizeof(A[i]));
tmc_mem_flush(&B[i], sizeof(B[i]));
tmc_mem_inv(&A[i], sizeof(A[i]));
tmc_mem_inv(&B[i], sizeof(B[i]));
}
//serial operation
if(threads==1)
{
//start timer
TIME_GET(&start);
//computation
for(i=0;i<size;i++)
{
for(k=0;k<size;k++)
{
for(j=0;j<size;j++)
{
C[i*size+j]+=A[i*size+k]*B[k*size+j];
}
}
}
TIME_GET(&end);
}
else
{
TIME_GET(&start);
#pragma omp parallel for num_threads(threads) private(i,j,k) schedule(dynamic)
for(i=0;i<size;i++)
{
for(j=0;j<size;j++)
{
for(k=0;k<size;k++)
{
C[i*size+j] +=A[i*size+k]*B[k*size+j];
}
}
}
TIME_GET(&end);
}
alloc_unmap(C, sizeof(INT64_TYPE)*(size*size));
alloc_unmap(B, sizeof(INT32_TYPE)*(size*size));
alloc_unmap(A, sizeof(INT32_TYPE)*(size*size));
//compute and return runtime
return TIME_RUNTIME(start,end);
}
Making the caching of the two input arrays incoherent and using OpenMP with dynamic scheduling helped me get the parallelized performance to surpass the serial performance. This is my first experience with a processor with a NUMA architecture so my 'optimizations' are light since I am still learning. Anyways, I tried using the same optimizations with the 8-bit integer version of the above code with all of the same conditions (number of threads and array sizes):
double matrix_matrix_mult_int8(int size,int threads)
{
//initialize index variables, random number generator, and timer
int i,j,k;
srand(SEED);
TIME_STRUCT start,end;
//allocate memory for matrices
alloc_attr_t attrA = ALLOC_INIT;
alloc_attr_t attrB = ALLOC_INIT;
alloc_attr_t attrC = ALLOC_INIT;
alloc_set_home(&attrA, ALLOC_HOME_INCOHERENT);
alloc_set_home(&attrB, ALLOC_HOME_INCOHERENT);
alloc_set_home(&attrC, ALLOC_HOME_TASK);
INT8_TYPE *A=alloc_map(&attrA, sizeof(INT8_TYPE)*(size*size));
INT8_TYPE *B=alloc_map(&attrB, sizeof(INT8_TYPE)*(size*size));
INT16_TYPE *C=alloc_map(&attrC, sizeof(INT16_TYPE)*(size*size));
#pragma omp parallel for num_threads(threads) private(i)
for(i=0;i<(size*size);i++)
{
A[i] = rand();
B[i] = rand();
C[i] = 0;
tmc_mem_flush(&A[i], sizeof(A[i]));
tmc_mem_flush(&B[i], sizeof(B[i]));
tmc_mem_inv(&A[i], sizeof(A[i]));
tmc_mem_inv(&B[i], sizeof(B[i]));
}
//serial operation
if(threads==1)
{
//start timer
TIME_GET(&start);
//computation
for(i=0;i<size;i++)
{
for(k=0;k<size;k++)
{
for(j=0;j<size;j++)
{
C[i*size+j]+=A[i*size+k]*B[k*size+j];
}
}
}
TIME_GET(&end);
}
else
{
TIME_GET(&start);
#pragma omp parallel for num_threads(threads) private(i,j,k) schedule(dynamic)
for(i=0;i<size;i++)
{
for(j=0;j<size;j++)
{
for(k=0;k<size;k++)
{
C[i*size+j] +=A[i*size+k]*B[k*size+j];
}
}
}
TIME_GET(&end);
}
alloc_unmap(C, sizeof(INT16_TYPE)*(size*size));
alloc_unmap(B, sizeof(INT8_TYPE)*(size*size));
alloc_unmap(A, sizeof(INT8_TYPE)*(size*size));
//compute and return runtime
return TIME_RUNTIME(start,end);
}
However, the 8-bit OpenMP version resulted in a time that was slower than the 32-bit OpenMP version. Shouldn't the 8-bit version execute faster than the 32-bit version? What could be the cause of this discrepancy and what are some possible things that could alleviate it? Could it be related to the data type of the arrays I'm using or something else?