I have way too much difference between different runs of sequential and, especially, parallel code. For example, the sequential version takes 418s. The parallel versions take:
2 threads - 250.630453 ; 339.735046 ; 256.153005 ; 256.153005 ; 311.177856
4 threads - 119.442949 ; 116.032005 ; 165.095566 ; 149.539717 ; 180.880198
8 threads - 73.856070 ; 68.082326 ; 76.318023 ; 68.922623 ; 55.321316
16 threads - 56.687378 ; 45.672769 ; 48.757555 ; 42.978104 ; 36.978891
32 threads - 24.421824 ; 21.459057 ; 23.815743 ; 24.936219 ; 24.581316
64 threads - 14.789693 ; 15.312125 ; 16.770807 ; 13.371806 ; 14.282328
The machine has 2 sockets, 32 physical cores (Intel Xeon E5-2698v3) and hyper threading. There are no other user processes running on the machine.
How normal is this? Some runs have variations of more than 55%. The parallel code does interfere in the convergence rate of the algorithm (which is iterative), but not to this extent. In particular, I ran this very same code on another computer and it was way more stable. I didn't try yet to run other parallel codes to see how stable they are.
EDIT: Forgot to say that (1) the sequential version has a lot of variation itself (at least 20%) and (2) I tried all combination of affinities and neither stability nor performance was rendered consistently better.