-1

I have an application that uses multithreading as its main operation is divided into same block of code executed on independent pieces of data structure.

consider it as a tree where each node executes an operation independently on others. so I create thread for each node's operation.

I tested the performance of this code on 2 machines and the execution time vs no of threads's graph is shown..

My question is ... given the same code . why such difference could happen ? (why it saturates fast on of the machine than the other )

also, running the same code for 48 machine gives worse results ?

enter image description here

RED line machine specs: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 4 NUMA node(s): 2

Blue Line machine specs : CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 2 NUMA node(s): 1

same core speed for both and same caches values.

Confirmed from the answer :: tried

numactl --cpunodebind=0 --membind=0 {exe}

to run on single numa node and results are consistent.. it was numa issue

becks
  • 2,656
  • 8
  • 35
  • 64
  • It would really help to know exactly what CPUs the two machines have. The specs you've given us are very vague. It could be memory bandwidth limited for all we know. – David Schwartz Jan 06 '19 at 09:36
  • both are Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz – becks Jan 06 '19 at 14:04

1 Answers1

1

The machines are very different. One is NUMA, the other is not. Threads running on different NUMA nodes have vastly increased synchronization costs. Even the way memory is allocated matters a lot for performance.

Writing parallel code which scales well to large NUMA machines can be very hard. It's important to avoid unnecessary synchronization between threads and to allocate memory on the NUMA node where it is primarily used. It is also very costly if one cache line is frequently written by one or more threads and read from a different NUMA node. (This is what makes synchronization with regular concurrency primitives such as mutexes or read-write locks so expensive on NUMA machines.) Spinlocks can have very poor performance as well.

As a stop-gap measure, you might get better performance in the NUMA case if you pin the process to cores which are located on the same NUMA node.

Florian Weimer
  • 32,022
  • 3
  • 48
  • 92