I have an application that uses multithreading as its main operation is divided into same block of code executed on independent pieces of data structure.
consider it as a tree where each node executes an operation independently on others. so I create thread for each node's operation.
I tested the performance of this code on 2 machines and the execution time vs no of threads's graph is shown..
My question is ... given the same code . why such difference could happen ? (why it saturates fast on of the machine than the other )
also, running the same code for 48 machine gives worse results ?
RED line machine specs: CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 4 NUMA node(s): 2
Blue Line machine specs : CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 2 NUMA node(s): 1
same core speed for both and same caches values.
Confirmed from the answer :: tried
numactl --cpunodebind=0 --membind=0 {exe}
to run on single numa node and results are consistent.. it was numa issue