OpenMP performance of different data types on NUMA architecture

Question

I am in the process of trying to optimize some matrix-matrix multiplication benchmark code that uses OpenMP on a MAESTRO processor. The MAESTRO has 49 processors arranged in a two-dimensional array in a 7x7 configuration. Each core has its own L1 and L2 cache. A layout of the board can be seen here: https://i.stack.imgur.com/RG0fC.png.

My main question is: Can different data types (char vs short vs int, etc.) directly impact the performance of OpenMP code on NUMA-based processors? If so, is there a way to alleviate it? Below is my explanation of why I am asking this.

I was given a set of benchmarks that had been used by a research group to measure the performance of a given processor. The benchmarks resulted in increased performance for other processors, but they ran into the issue of not seeing the same type of results when running them on the MAESTRO. Here is a snippet of the matrix multiplication benchmark from the base code I received:

Relevant macros from header file (MAESTRO is 64-bit):

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <cblas.h>
#include <omp.h>

//set data types
#ifdef ARCH64
    //64-bit architectures
    #define INT8_TYPE char
    #define INT16_TYPE short
    #define INT32_TYPE int
    #define INT64_TYPE long
#else
    //32-bit architectures
    #define INT8_TYPE char
    #define INT16_TYPE short
    #define INT32_TYPE long
    #define INT64_TYPE long long
#endif
#define SPFP_TYPE float
#define DPFP_TYPE double

//setup timer

//us resolution
#define TIME_STRUCT struct timeval
#define TIME_GET(time) gettimeofday((time),NULL)
#define TIME_DOUBLE(time) (time).tv_sec+1E-6*(time).tv_usec
#define TIME_RUNTIME(start,end) TIME_DOUBLE(end)-TIME_DOUBLE(start)

//select random seed method
#ifdef FIXED_SEED
    //fixed
    #define SEED 376134299
#else
    //based on system time
    #define SEED time(NULL)
#endif

32-bit integer matrix multiplication benchmark:

double matrix_matrix_mult_int32(int size,int threads)
{


//initialize index variables, random number generator, and timer
    int i,j,k;
    srand(SEED);
    TIME_STRUCT start,end;

//allocate memory for matrices
INT32_TYPE *A=malloc(sizeof(INT32_TYPE)*(size*size));
INT32_TYPE *B=malloc(sizeof(INT32_TYPE)*(size*size));
INT64_TYPE *C=malloc(sizeof(INT64_TYPE)*(size*size));

//initialize input matrices to random numbers
//initialize output matrix to zeros
for(i=0;i<(size*size);i++)
{
    A[i]=rand();
    B[i]=rand();
    C[i]=0;
}

//serial operation
if(threads==1)
{
    //start timer
    TIME_GET(&start);
    //computation
    for(i=0;i<size;i++)
    {
        for(k=0;k<size;k++)
        {
            for(j=0;j<size;j++)
            {
                C[i*size+j]+=A[i*size+k]*B[k*size+j];
            }
        }
    }
    //end timer
    TIME_GET(&end);
}
//parallel operation
else
{
    //start timer
    TIME_GET(&start);
    //parallelize with OpenMP
    #pragma omp parallel for num_threads(threads) private(i,j,k)
    for(i=0;i<size;i++)
    {
        for(k=0;k<size;k++)
        {
            for(j=0;j<size;j++)
            {
                C[i*size+j]+=A[i*size+k]*B[k*size+j];
            }
        }
    }
    //end timer
    TIME_GET(&end);
}

//free memory
free(C);
free(B);
free(A);

//compute and return runtime
return TIME_RUNTIME(start,end);
}

Running the above benchmark serially resulted in better performance than running it with OpenMP. I was tasked with optimizing the benchmark for the MAESTRO to get better performance. Using the following code, I was able to get a performance increase:

double matrix_matrix_mult_int32(int size,int threads)
{

//initialize index variables, random number generator, and timer
    int i,j,k;
    srand(SEED);
    TIME_STRUCT start,end;


    //allocate memory for matrices
    alloc_attr_t attrA = ALLOC_INIT;
    alloc_attr_t attrB = ALLOC_INIT;
    alloc_attr_t attrC = ALLOC_INIT;

    alloc_set_home(&attrA, ALLOC_HOME_INCOHERENT);
    alloc_set_home(&attrB, ALLOC_HOME_INCOHERENT);
    alloc_set_home(&attrC, ALLOC_HOME_TASK);

    INT32_TYPE *A=alloc_map(&attrA, sizeof(INT32_TYPE)*(size*size));
    INT32_TYPE *B=alloc_map(&attrB, sizeof(INT32_TYPE)*(size*size));
    INT64_TYPE *C=alloc_map(&attrC, sizeof(INT64_TYPE)*(size*size));

    #pragma omp parallel for num_threads(threads) private(i)
    for(i=0;i<(size*size);i++)
    {

        A[i] = rand();
        B[i] = rand();
        C[i] = 0;
        tmc_mem_flush(&A[i], sizeof(A[i]));
        tmc_mem_flush(&B[i], sizeof(B[i]));
        tmc_mem_inv(&A[i], sizeof(A[i]));
        tmc_mem_inv(&B[i], sizeof(B[i]));
    }


    //serial operation
    if(threads==1)
    {
        //start timer 
        TIME_GET(&start);

        //computation
        for(i=0;i<size;i++)
        {
            for(k=0;k<size;k++)
            {
                for(j=0;j<size;j++)
                {   
                    C[i*size+j]+=A[i*size+k]*B[k*size+j];
                }
            }
        }

     TIME_GET(&end);

    }
    else
    {

      TIME_GET(&start);

      #pragma omp parallel for num_threads(threads) private(i,j,k) schedule(dynamic)
      for(i=0;i<size;i++)
      {
          for(j=0;j<size;j++)
          {
              for(k=0;k<size;k++)
              {
                  C[i*size+j] +=A[i*size+k]*B[k*size+j];
              }
          }
      }

      TIME_GET(&end);
    }


    alloc_unmap(C, sizeof(INT64_TYPE)*(size*size));
    alloc_unmap(B, sizeof(INT32_TYPE)*(size*size));
    alloc_unmap(A, sizeof(INT32_TYPE)*(size*size));


    //compute and return runtime
    return TIME_RUNTIME(start,end);
}

Making the caching of the two input arrays incoherent and using OpenMP with dynamic scheduling helped me get the parallelized performance to surpass the serial performance. This is my first experience with a processor with a NUMA architecture so my 'optimizations' are light since I am still learning. Anyways, I tried using the same optimizations with the 8-bit integer version of the above code with all of the same conditions (number of threads and array sizes):

double matrix_matrix_mult_int8(int size,int threads)
{

//initialize index variables, random number generator, and timer
    int i,j,k;
    srand(SEED);
    TIME_STRUCT start,end;


    //allocate memory for matrices
    alloc_attr_t attrA = ALLOC_INIT;
    alloc_attr_t attrB = ALLOC_INIT;
    alloc_attr_t attrC = ALLOC_INIT;

    alloc_set_home(&attrA, ALLOC_HOME_INCOHERENT);
    alloc_set_home(&attrB, ALLOC_HOME_INCOHERENT);
    alloc_set_home(&attrC, ALLOC_HOME_TASK);

    INT8_TYPE *A=alloc_map(&attrA, sizeof(INT8_TYPE)*(size*size));
    INT8_TYPE *B=alloc_map(&attrB, sizeof(INT8_TYPE)*(size*size));
    INT16_TYPE *C=alloc_map(&attrC, sizeof(INT16_TYPE)*(size*size));

    #pragma omp parallel for num_threads(threads) private(i)
    for(i=0;i<(size*size);i++)
    {

        A[i] = rand();
        B[i] = rand();
        C[i] = 0;
        tmc_mem_flush(&A[i], sizeof(A[i]));
        tmc_mem_flush(&B[i], sizeof(B[i]));
        tmc_mem_inv(&A[i], sizeof(A[i]));
        tmc_mem_inv(&B[i], sizeof(B[i]));
    }


    //serial operation
    if(threads==1)
    {
        //start timer 
        TIME_GET(&start);

        //computation
        for(i=0;i<size;i++)
        {
            for(k=0;k<size;k++)
            {
                for(j=0;j<size;j++)
                {   
                    C[i*size+j]+=A[i*size+k]*B[k*size+j];
                }
            }
        }

     TIME_GET(&end);

    }
    else
    {

      TIME_GET(&start);

      #pragma omp parallel for num_threads(threads) private(i,j,k) schedule(dynamic)
      for(i=0;i<size;i++)
      {
          for(j=0;j<size;j++)
          {
              for(k=0;k<size;k++)
              {
                  C[i*size+j] +=A[i*size+k]*B[k*size+j];
              }
          }
      }

      TIME_GET(&end);
    }


    alloc_unmap(C, sizeof(INT16_TYPE)*(size*size));
    alloc_unmap(B, sizeof(INT8_TYPE)*(size*size));
    alloc_unmap(A, sizeof(INT8_TYPE)*(size*size));


    //compute and return runtime
    return TIME_RUNTIME(start,end);
}

However, the 8-bit OpenMP version resulted in a time that was slower than the 32-bit OpenMP version. Shouldn't the 8-bit version execute faster than the 32-bit version? What could be the cause of this discrepancy and what are some possible things that could alleviate it? Could it be related to the data type of the arrays I'm using or something else?

It's misleading to say this chip is 7X7 NUMA because 7x7 NUMA means 7 nodes per cluster and 7 clusters in total. This chip obviously has only 4 external controllers. — user3528438, Dec 13 '16 at 21:06
The cores in this chip actually has a pretty large L2 cache, so if your data set is not large enough, then using smaller data type will waste a lot of time doing conversion to and from full width types. If squeezing data does not improve your performance, then it means there's no need to do it. — user3528438, Dec 13 '16 at 21:14
What is the size of the matrix ? btw this my help you http://lemire.me/blog/2013/09/13/are-8-bit-or-16-bit-counters-faster-than-32-bit-counters/ — dreamcrash, Dec 13 '16 at 22:01
Why don't you use e.g. `int32_t` instead of `INT32_TYPE` defining your own types? That's the main reason these were defined in C99. — Z boson, Dec 14 '16 at 10:58
What compiler optimization did you use? What compiler options did you use? What compiler? — Z boson, Dec 14 '16 at 10:59
The size I have been using to pass through 'int size' for both the 32-bit and 8-bit functions has been 450. I get ~0.77 seconds for the 32-bit function and ~1.5 seconds for the 8-bit function. I will try using int32_t instead of INT32_TYPE to see if it makes a difference and report back. The compiler I am using is tile-gcc `:/usr/local/MAESTRO-GCC-MDE-0.9/bin/tile-gcc: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, BuildID[sha1]=ab4a1e99aaa0a9b4ec7336641f21613fb6b5a73c, not stripped` — Shaun Holtzman, Dec 14 '16 at 14:24
`/usr/local/MAESTRO-GCC-MDE-0.9/bin/tile-gcc --version tile-gcc (GCC) 4.7.3` Compiler flags I used: `-ltmc -lpthread -fopenmp -lm -O3 -Ofast -march=maestro` Also, and I don't know if I am off base here, but when I compile my benchmarks and do a 'file' on the executable it reports `ELF 32-bit LSB`. Knowing that the MAESTRO is 64-bit, is this a problem? The compiler says it only supports compiling as 32-bit. — Shaun Holtzman, Dec 14 '16 at 14:30
@Zboson Using the C99 types didn't seem to make a difference. However, I made some changes to the for loop and got new results. I changed the inner loop to: `...for(k=0,result=0;k — Shaun Holtzman, Dec 14 '16 at 16:24
My point with `int32_t` was only about style not results. I did not read your question in detail. — Z boson, Dec 15 '16 at 08:37

score 2 · Answer 1 · edited May 23 '17 at 12:24

two things that come to mind are

your 8-bit (one btye) data type versus a 32-bit (four btye) data type and the given compiler aligning data structures to N-byte boundaries. I think it's typically 4-byte boundaries, especially when it's defaulted to 32-bit. There is a compiler option to force the alignment boundaries.

Why does compiler align N byte data types on N byte boundaries?

There might be extra operations happening to handle a one-byte data type where the other 3 bytes have to be masked off in order to get the correct value, versus no masking operations happening with a standard 32-bit (or 64-bit) data type.

The other is processor and memory affinity, and whether the parallel OPENMP code that is being run on a given core is fetching or writing data from memory that is not connected directly to that cpu core. Then whatever hub(s) the fetching has to go through to reach memory that is distant will obviously cause an increase in run time. I'm not sure if this applies to your MAESTRO type of system which I am unfamiliar with; but what i'm describing is on late model INTEL 4-cpu systems which are connected via Intel quickpath connect (QPI). For example if you are running on core 0 of cpu 0, then fetching from memory of DRAM modules closest to that cpu core will be fastest, versus accessing DRAM over QPI connected to core N on CPU 3, versus going through some hub or infiniband to access DRAM on some other blade or node, and so on. I know affinity can be handled with MPI, and I believe it can be with OPENMP but maybe not as well. You might try researching "openmp cpu memory affinity".

OpenMP performance of different data types on NUMA architecture

1 Answers1