2

I would like to take advantage of OpenMP to make my task parallel.

I need to subtract the same quantity to all the elements of an array and write the result in another vector. Both arrays are dynamically allocated with malloc and the first one is filled with values from a file. Each element is of type uint64_t.

#pragma omp parallel for
for (uint64_t i = 0; i < size; ++i) {
    new_vec[i] = vec[i] - shift;
}

Where shift is the fixed value I want to remove from every element of vec. size is the length of both vec and new_vec, which is approximately 200k.

I compile the code with g++ -fopenmp on Arch Linux. I'm on an Intel Core i7-6700HQ, and I use 8 threads. The running time is 5 to 6 times higher when I use the OpenMP version. I can see that all the cores are working when I run the OpenMP version.

I think this might be caused by a False Sharing issue, but I can't find it.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • You are memory bandwidth limited as the calculation you are trying to make parallel is trivial and is basically just moving data between memory locations. Adding threads will cause memory cache misses/thrashing and pre-fetch failures. The effect of this is the code running slower. Very approx 1.5 threads can saturate the memory bus on a modern PC. – Richard Critten Jul 11 '17 at 11:21
  • @RichardCritten that is not true. High end processors have upper bandwidth limits designed in a way such that you need to use multithreading to saturate them. Check out the link on my answer. – Jorge Bellon Jul 11 '17 at 12:05
  • 1
    How do you measure the execution time? – Hristo Iliev Jul 11 '17 at 13:54
  • @HristoIliev I'm using `perf stat -r200` to average the execution time – Matteo Pompili Jul 12 '17 at 07:21
  • Could you produce a [mcve]? What is the absolute execution time of the loop in seconds? Is this the only OpenMP region in the code? How many times does it get executed in a single program run? It might be the OpenMP overhead. – Hristo Iliev Jul 12 '17 at 08:02

1 Answers1

4

You should adjust how the iterations are split among the threads. With schedule(static,chunk_size) you are able to do so.

Try to use chunk_size values multiples of 64/sizeof(uint64_t) to avoid the said false sharing:

[ cache line n   ][ cache line n+1 ]
[ chuhk 0  ][ chunk 1  ][ chunk 2  ]

And achieve something like this:

[ cache line n   ][ cache line n+1 ][ cache line n+2 ][...]
[ chunk 0                          ][ chunk 1             ]

You also should allocate your vectors in such a way that they are aligned to cache lines. That way you ensure that the first and subsequent chunks are properly aligned as well.

#define CACHE_LINE_SIZE sysconf(_SC_LEVEL1_DCACHE_LINESIZE) 
uint64_t *vec = aligned_alloc( CACHE_LINE_SIZE/*alignment*/, 200000 * sizeof(uint64_t)/*size*/);

Your problem is really similar to what Stream Triad benchmark represents. Check out how to optimize that benchmark and you will be able to map almost exactly the optimizations on your code.

Jorge Bellon
  • 2,901
  • 15
  • 25
  • 1
    Among other things, thorough discussions of Stream include choice of number of threads and their distribution among cores. Multiple threads per core almost certainly will slow it down. Did you maintain simd vectorization when you set omp parallel? – tim18 Jul 11 '17 at 12:18
  • You can request for both multithreading and vectorization using `omp for simd`, which is a combination of the one you are using and `omp simd`. Check the [reference for `omp simd`](https://software.intel.com/en-us/node/524530). Compilers may be able to vectorize it without any hints, but given that you have the possibility to specify it, it is way better to put the clause too. – Jorge Bellon Jul 11 '17 at 12:24
  • 2
    The default loop schedule for GCC is `static`, therefore there are exactly 7 cache lines out of ~25000 that might be shared between the threads and up to 8 iterations with false sharing out of ~25000 per thread. The problem is most likely using `clock()` to measure the execution time on Linux and there are countless similar questions here. – Hristo Iliev Jul 11 '17 at 14:39
  • Besides maintaining simd vectorization, it's important to engage nontemporal store in order to get efficient multi-threaded execution. As Hristo said, if you total up the time of all threads you will surely see it increase with number of threads. – tim18 Jul 12 '17 at 11:47