Variations in measurements of parallel code

Question

I have way too much difference between different runs of sequential and, especially, parallel code. For example, the sequential version takes 418s. The parallel versions take:

2 threads - 250.630453 ; 339.735046 ; 256.153005 ; 256.153005 ; 311.177856

4 threads - 119.442949 ; 116.032005 ; 165.095566 ; 149.539717 ; 180.880198

8 threads - 73.856070 ; 68.082326 ; 76.318023 ; 68.922623 ; 55.321316

16 threads - 56.687378 ; 45.672769 ; 48.757555 ; 42.978104 ; 36.978891

32 threads - 24.421824 ; 21.459057 ; 23.815743 ; 24.936219 ; 24.581316

64 threads - 14.789693 ; 15.312125 ; 16.770807 ; 13.371806 ; 14.282328

The machine has 2 sockets, 32 physical cores (Intel Xeon E5-2698v3) and hyper threading. There are no other user processes running on the machine.

How normal is this? Some runs have variations of more than 55%. The parallel code does interfere in the convergence rate of the algorithm (which is iterative), but not to this extent. In particular, I ran this very same code on another computer and it was way more stable. I didn't try yet to run other parallel codes to see how stable they are.

EDIT: Forgot to say that (1) the sequential version has a lot of variation itself (at least 20%) and (2) I tried all combination of affinities and neither stability nor performance was rendered consistently better.

Thread migration between the CPUs affects the performance. Besides, that dual-socket system is NUMA. Use `KMP_AFFINITY` or `GOMP_CPU_AFFINITY` to bind the threads and the execution times will get much more consistent. If your compiler understands OpenMP 4.0, then set `OMP_PLACES` accordingly. — Hristo Iliev, Nov 03 '15 at 13:25
For higher execution times, with fewer threads (2), it gets even worse! 1056.12, then 703.04. — a3mlord, Nov 04 '15 at 23:17
Have you tried analysing the code with Intel VTune or similar tools? — timdykes, Nov 05 '15 at 05:17
Is this some kind of a stochastic algorithm? Monte Carlo? If the sequential version varies in its execution time, and if that is an intrinsic property of the algorithm, then the execution time of the parallel version has expectedly higher variance. — Hristo Iliev, Nov 05 '15 at 06:39
This is a randomized algorithm, but the starting seed is always the same. In sequential, it does the very same number of iterations. There is no reason to have variations in the sequential version! BTW, the code is much more stable on a different machine. — a3mlord, Nov 05 '15 at 08:21

Variations in measurements of parallel code

0 Answers0