1

I believe I am experiencing false sharing using OpenMP. Is there any way to identify it and fix it?

My code is: https://github.com/wchan/libNN/blob/master/ResilientBackpropagation.hpp line 36.

Using a 4 core CPU compared to the single threaded 1 core version yielded only 10% in additional performance. When using a NUMA 32 physical (64 virtual) CPU system, the CPU utilization is stuck at around 1.5 cores, I think this is a direct symptom of false sharing and unable to scale.

I also tried running it with Intel VTune profiler, it stated most of the time is spent on the "f()" and "+=" functions. I believe this is reasonable and doesn't really explain why I am getting such poor scaling...

Any ideas/suggestions?

Thanks.

  • 3
    False sharing doesn't decrease your CPU utilization. It just causes tons of cache misses. – Mysticial Jan 27 '12 at 00:52
  • @Mystical - My understanding was that on NUMA it might if the scheduler was scheduling all threads on the processors which owned the page to avoid migrating it around excessively. – Flexo Jan 27 '12 at 18:54
  • @awoodland That's certainly a possibility - albeit another consequence of having everything adjacent in memory. (I didn't get your ping since you left out the second `i` in my UN.) – Mysticial Jan 27 '12 at 20:59

2 Answers2

2

Use reduction instead of explicitly indexing an array based on the thread ID. That array virtually guarantees false sharing.

i.e. replace this

#pragma omp parallel for 
    clones[omp_get_thread_num()]->mse() += norm_2(dedy);

for (int i = 0; i < omp_get_max_threads(); i++) {
     neural_network->mse() += clones[i]->mse();

with this:

#pragma omp parallel for reduction(+ : mse)
     mse += norm_2(dedy);

neural_network->mse() = mse;
Community
  • 1
  • 1
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
1

One way of knowing for sure is looking at cache statistics with a tool like cachegrind :

valgrind --tool=cachegrind [command]
Marek Grzenkowicz
  • 17,024
  • 9
  • 81
  • 111
  • yes this is what i was thinking use thread profiling tools +1 – pyCthon Feb 03 '12 at 03:57
  • It does support multiple threads, but valgrind uses it's own internal scheduler so the thread execution is sequentialized. I don't think cachegrind is a good choice here. – janjust Apr 09 '12 at 16:03