1

I am working in parallel with OpenMP on an array (working part). If I initialize the array in parallel before, then my working part takes 18 ms. If I initialize the array serially without OpenMP, then my working part takes 58 ms. What causes the worse performance?

The system:

  • Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores / 56 threads, 2 Sockets)

Example code:

unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);

// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i++){
    array[i]= i%10;
}


// Time start

// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
    if (array[i] < 4)
    {
        sum += array[i];
    }
}

// Time End
Tored
  • 33
  • 5
  • 2
    _"NUMA Architecture with 28 cores"_ Could you be more specific? Which processor? How many sockets / NUMA nodes / memory controllers? Anyway, you may want to read about the _first touch policy_ (e.g., here: https://stackoverflow.com/q/12196553/580083). – Daniel Langr Apr 29 '22 at 08:52
  • 2
    Another useful link: https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-vanderPas.pdf. See slides 19 to 22, they nicely describe your issue. – Daniel Langr Apr 29 '22 at 09:01

2 Answers2

4

There are two aspects at work here:

NUMA allocation

In a NUMA system, memory pages can be local to a CPU or remote. By default Linux allocates memory in a first-touch policy, meaning the first write access to a memory page determines on which node the page is physically allocated.

If your malloc is large enough that new memory is requested from the OS (instead of reusing existing heap memory), this first touch will happen in the initialization. Because you use static scheduling for OpenMP, the same thread will use the memory that initialized it. Therefore, unless the thread gets migrated to a different CPU, which is unlikely, the memory will be local.

If you don't parallelize the initialization, the memory will end up local to the main thread which will be worse for threads that are on different sockets.

Note that Windows doesn't use a first-touch policy (AFAIK). So this behavior is not portable.

Caching

The same as above also applies to caches. The initialization will put array elements into the cache of the CPU doing it. If the same CPU accesses the memory during the second phase, it will be cache-hot and ready to use.

Homer512
  • 9,144
  • 2
  • 8
  • 25
2

First of all, the explanation by @Homer512 is completely correct.

Now I note that you marked this question "C++", but you're using malloc for your array. That is bad style in C++: you should use std::vector for your simple containers, std::array for small enough ones.

And then you have a big problem because std::vector uses "value initialization": the whole array is automatically filled with zeroes, and there is no way you can let this be done in parallel with OpenMP.

Here is a big trick:

template<typename T> 
struct uninitialized {
  uninitialized() {};
  T val;
  constexpr operator T() const {return val;};
  double operator=( const T&& v ) { val = v; return val; };
};

vector<uninitialized<double>> x(N),y(N);

#pragma omp parallel for 
for (int i=0; i<N; i++)
  y[i] = x[i] = 0.; 
x[0] = 0; x[N-1] = 1.;
Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23