I currently have two machines which produce different outputs for an instance of np.dot
on two vectors. Without digging through the many layers of abstraction leading from NumPy to BLAS, I was able to reproduce the discrepancy in scipy.linalg.blas.ddot
, so I assume an explanation of the discrepancy in BLAS also explains the discrepancy in NumPy. Concretely, consider the following example:
import numpy as np
from scipy.linalg.blas import ddot
u = np.array([0.13463703107579461093, -0.07773272613450200874, -0.98784132994666418170])
v = np.array([-0.86246572448831815283, -0.03715105562531360872, -0.50475010960748223354])
a = np.dot(v, u)
b = v[0]*u[0] + v[1]*u[1] + v[2]*u[2]
c = ddot(v, u)
print(f'{a:.25f}')
print(f'{b:.25f}')
print(f'{c:.25f}')
This produces the following outputs:
Machine 1 Machine 2
a 0.3853810478481685120044631 0.3853810478481685675156143
b 0.3853810478481685120044631 0.3853810478481685120044631
c 0.3853810478481685120044631 0.3853810478481685675156143
Similarly, the following piece of Cython gives rise to the same discrepancy:
cimport scipy.linalg.cython_blas
cimport numpy as np
import numpy as np
cdef np.float64_t run_test(np.double_t[:] a, np.double_t[:] b):
cdef int ix, iy, n
ix = iy = 1
n = 3
return scipy.linalg.cython_blas.ddot(&n, &a[0], &ix, &b[0], &iy)
a = np.array([0.13463703107579461093, -0.07773272613450200874, -0.98784132994666418170])
b = np.array([-0.86246572448831815283, -0.03715105562531360872, -0.50475010960748223354])
print(f'{run_test(a, b):.25f}')
So, I'm trying to understand what could give rise to this.
The machines in question run Windows 10 (Intel(R) Core(TM) i7-5600U) and Windows Server 2016 (Intel(R) Xeon(R) Gold 6140) respectively.
In both cases have I set up fresh conda environments with nothing but numpy
, scipy
, cython
, and their dependencies. I've run checksums on the environments to ensure that the binaries that end up being included agree and verified that the outputs of np.__config__.show()
match up. Similarly I checked that the outputs of mkl.get_version_string()
agree on the two machines.
This leads me to think that the problem might be in differences in hardware. I did not look into what instructions end up being executed (lacking a straightforward way to debug the Cython code on Windows/MSVC), but I checked that both machines support AVX2/FMA, which seemed like it could be one source of the discrepancy.
On the other hand, I did find that the two machines support different instruction sets, though. Concretely
Machine 1 (i7) Machine 2 (Xeon)
AVX Y Y
AVX2 Y Y
AVX512CD N Y
AVX512ER N N
AVX512F N Y
AVX512PF N N
FMA Y Y
I am, however, not aware of a good way to determine if this by itself is sufficient to explain the discrepancy, or if it's a red herring(?)
So my question becomes:
Starting from the above, what are some natural steps to try to pin down the cause of the discrepancy? Is it assembly time, or is there something more obvious?