Pinning down a discrepancy in ddot between two machines

Question

I currently have two machines which produce different outputs for an instance of np.dot on two vectors. Without digging through the many layers of abstraction leading from NumPy to BLAS, I was able to reproduce the discrepancy in scipy.linalg.blas.ddot, so I assume an explanation of the discrepancy in BLAS also explains the discrepancy in NumPy. Concretely, consider the following example:

import numpy as np
from scipy.linalg.blas import ddot

u = np.array([0.13463703107579461093, -0.07773272613450200874, -0.98784132994666418170])
v = np.array([-0.86246572448831815283, -0.03715105562531360872, -0.50475010960748223354])
a = np.dot(v, u)
b = v[0]*u[0] + v[1]*u[1] + v[2]*u[2]
c = ddot(v, u)
print(f'{a:.25f}')
print(f'{b:.25f}')
print(f'{c:.25f}')

This produces the following outputs:

                      Machine 1                   Machine 2
a   0.3853810478481685120044631 0.3853810478481685675156143
b   0.3853810478481685120044631 0.3853810478481685120044631
c   0.3853810478481685120044631 0.3853810478481685675156143

Similarly, the following piece of Cython gives rise to the same discrepancy:

cimport scipy.linalg.cython_blas
cimport numpy as np
import numpy as np

cdef np.float64_t run_test(np.double_t[:] a, np.double_t[:] b):
    cdef int ix, iy, n
    ix = iy = 1
    n = 3
    return scipy.linalg.cython_blas.ddot(&n, &a[0], &ix, &b[0], &iy)

a = np.array([0.13463703107579461093, -0.07773272613450200874, -0.98784132994666418170])
b = np.array([-0.86246572448831815283, -0.03715105562531360872, -0.50475010960748223354])
print(f'{run_test(a, b):.25f}')

So, I'm trying to understand what could give rise to this.

The machines in question run Windows 10 (Intel(R) Core(TM) i7-5600U) and Windows Server 2016 (Intel(R) Xeon(R) Gold 6140) respectively.

In both cases have I set up fresh conda environments with nothing but numpy, scipy, cython, and their dependencies. I've run checksums on the environments to ensure that the binaries that end up being included agree and verified that the outputs of np.__config__.show() match up. Similarly I checked that the outputs of mkl.get_version_string() agree on the two machines.

This leads me to think that the problem might be in differences in hardware. I did not look into what instructions end up being executed (lacking a straightforward way to debug the Cython code on Windows/MSVC), but I checked that both machines support AVX2/FMA, which seemed like it could be one source of the discrepancy.

On the other hand, I did find that the two machines support different instruction sets, though. Concretely

          Machine 1 (i7)       Machine 2 (Xeon)
AVX                    Y                      Y
AVX2                   Y                      Y
AVX512CD               N                      Y
AVX512ER               N                      N
AVX512F                N                      Y
AVX512PF               N                      N
FMA                    Y                      Y

I am, however, not aware of a good way to determine if this by itself is sufficient to explain the discrepancy, or if it's a red herring(?)

So my question becomes:

Starting from the above, what are some natural steps to try to pin down the cause of the discrepancy? Is it assembly time, or is there something more obvious?

This difference can come out of summing the products in a different order (I don't mean theoretically, summing in a different order literally produces these two different answers in this case). But that doesn't explain why that would even happen. — harold, Dec 20 '19 at 16:01
The main path for the blas on the AVX-512 machine is likely doing something like an 8-in-parallel accumulation followed by a horizontal add, while the AVX2 one is probably 4-in-parallel. Obviously in this case what's going to matter is the details of the remainder handling, but the point is that it's likely two different internal routines are being invoked. If you must have bit-identical results, you'll have to somehow force the two to execute the same code path, or not rely on something like mkl that's designed to adapt to the machine it's running on. — bg2b, Dec 20 '19 at 18:47
@bg2b: How would you go about determining if this is indeed what is happening? I'm not particularly interested in getting bit-identical results; only in understand what is causing this. — fuglede, Dec 20 '19 at 20:00
I haven't used MKL in a long time, but I think there are ways to disable the instruction sets that the library will use. If you can disable the AVX-512 extensions and get matching results, that would be strongly suggestive. https://software.intel.com/en-us/mkl-windows-developer-guide-instruction-set-specific-dispatching-on-intel-architectures — bg2b, Dec 20 '19 at 20:12
Thanks, this seemed promising but I wasn't able to get matching results. I tried setting the environment variable to `AVX2` and `AVX512` on both of the machines, then recompiling and rerunning the Cython code above, but in each case, the output would be the same as if the environment variable was unset. — fuglede, Dec 20 '19 at 20:26
Similarly, using `mkl.enable_instructions` from the conda package `mkl-service` did not change the outputs on any of the machines. — fuglede, Dec 20 '19 at 20:35
Sorry it didn't help. I still think that something along these lines is likely the issue given harold's observation, but I can't think of an easy way to prove it. — bg2b, Dec 20 '19 at 20:53
Just FYI, [AVX512ER means](https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512) you have a Xeon Phi accelerator card, not "just" a Skylake-Xeon. You don't mention that in your question, so people trying to reproduce this with MKL on a Skylake Xeon Gold might have MKL doing different dynamic dispatching. — Peter Cordes, Dec 21 '19 at 07:04
@PeterCordes: Thanks for spelling this out. And just to make it perfectly clear, as I can see how the formatting of the question could be better, Machine 2, Xeon Gold 6140, is the one that _does not_ support AVX512ER. — fuglede, Dec 21 '19 at 08:44
You have a Xeon Phi in a machine with an i7-5600U ultra-low-voltage CPU (usually used in non-gaming laptops)? What kind of machine is it? Is it some kind of machine designed around a Xeon Phi card or two, intended to run code that offloads its real work to accelerator cards? Seems like an odd choice of CPU for that. Or is it possible its running a software emulator for AVX512? (An i7-5600U itself is Broadwell and only has AVX2/FMA.) — Peter Cordes, Dec 21 '19 at 16:06
@PeterCordes: Sorry, good catch! Reading your comment i double checked my outputs, and indeed I mixed up the two machines ... whoops. I've edited the question to spell this out. The i7-5600U does support AVX512CD and AVX512F though. (Perhaps this is unsurprising?) — fuglede, Dec 22 '19 at 08:47
But I guess those just have no impact on MKL. Indeed, ListDLLs reveals that the i7 only uses `avx2.dll` and `avx512.dll`, and given all of your comments, I take this to mean problem solved. I'll add an answer about that for anyone who comes across this, but I would definitely also accept one that goes into more detail about what these 4 different AVX512 instruction sets mean for MKL, what instructions are emitted, and what that means on our concrete number example. — fuglede, Dec 22 '19 at 08:57
i7-5600U is Broadwell, aka 5th gen. It does *not* support anything beyond AVX2 + 256-bit FMA. It's a plain vanilla laptop chip from a few years ago (Q1 2015), when AVX512 was still just on the horizon, not present in any released CPUs. https://ark.intel.com/content/www/us/en/ark/products/85215/intel-core-i7-5600u-processor-4m-cache-up-to-3-20-ghz.html. A Skylake-Xeon will support AVX512CD (but not ER), and of course AVX512F (Foundation: any CPU with any AVX512 will have AVX512F). https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 — Peter Cordes, Dec 22 '19 at 13:02
As your answer shows, MKL runtime dispatching doesn't use any AVX512 libraries on your Broadwell machine. If you're using something else for CPU feature detection, it's broken or you're using it wrong. — Peter Cordes, Dec 22 '19 at 13:05
@PeterCordes: That's quite curious. I was following https://learn.microsoft.com/en-us/cpp/intrinsics/cpuid-cpuidex?redirectedfrom=MSDN&view=vs-2019 verbatim when reporting the above. If I use GCC instead, indeed ` __builtin_cpu_supports("avx512cd")` is false. — fuglede, Dec 22 '19 at 15:39
Okay, once again, my post-processing of the results must have simply failed, and on rerunning the tests, my results do match your expected instruction set support exactly. I appreciate the persistence and you spelling out expected results. I updated the question accordingly. — fuglede, Dec 22 '19 at 15:45
Yup, you probably were checking the wrong bit in the CPUID bitmap, or queried the wrong leaf (input EAX and sometimes ECX). Hand-rolled CPUID checking bugs explains all the previous nonsense. :P — Peter Cordes, Dec 22 '19 at 21:48
@PeterCordes: Good guesses and I wish they were correct but I'm afraid in reality I was dumber: I had simply flipped all my Ys and Ns. — fuglede, Dec 23 '19 at 09:53
@bg2b: I played around with it a bit more; while I still have a hard time configuring `MKL_ENABLE_INSTRUCTIONS` runtime in the Cython script, if I just set the environment variable prior to running Python/IPython (and then just use `scipy.linalg.blas.ddot` in ordinary Python), then I do indeed get the result you expected: That the AVX512 machine can be forced to produce results matching the AVX2 machine. So that's all great. — fuglede, Dec 24 '19 at 09:42

fuglede · Answer 1 · 2021-05-17T07:48:50.140

Given the excellent comments to the question, it seems clear that the difference between supported instruction sets is ultimately the culprit, and indeed we can use ListDLLs while running the Cython script to find that MKL loads different libraries based on the two cases.

For the i7 (machine 1):

>listdlls64 python.exe | wsl grep mkl
0x00000000b9ff0000  0xe7e000   [...]\miniconda3\envs\npfloattest\Library\bin\mkl_rt.dll
0x00000000b80e0000  0x1f05000  [...]\miniconda3\envs\npfloattest\Library\bin\mkl_intel_thread.dll
0x00000000b3b40000  0x43ba000  [...]\miniconda3\envs\npfloattest\Library\bin\mkl_core.dll
0x00000000b0e50000  0x2ce5000  [...]\miniconda3\envs\npfloattest\Library\bin\mkl_avx2.dll
0x00000000b01f0000  0xc58000   [...]\miniconda3\envs\npfloattest\Library\bin\mkl_vml_avx2.dll
0x00000000f88c0000  0x7000     [...]\miniconda3\envs\npfloattest\lib\site-packages\mkl\_mklinit.cp37-win_amd64.pyd
0x00000000afce0000  0x22000    [...]\miniconda3\envs\npfloattest\lib\site-packages\mkl\_py_mkl_service.cp37-win_amd64.pyd

For the Xeon (machine 2):

0x0000000057ec0000  0xe7e000   [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_rt.dll
0x0000000055fb0000  0x1f05000  [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_intel_thread.dll
0x0000000051bf0000  0x43ba000  [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_core.dll
0x000000004e1a0000  0x3a4a000  [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_avx512.dll
0x000000005c6c0000  0xc03000   [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_vml_avx512.dll
0x0000000079a70000  0x7000     [...]\Miniconda3\envs\npfloattest\lib\site-packages\mkl\_mklinit.cp37-win_amd64.pyd
0x000000005e830000  0x22000    [...]\Miniconda3\envs\npfloattest\lib\site-packages\mkl\_py_mkl_service.cp37-win_amd64.pyd

This very strongly indicates that the support of AVX512CD/AVX512F is enough to prompt MKL to use a different library, and thereby, presumably, ultimately a different set of instructions.

Now it's interesting to see how this actually unfolds: What instructions are emitted, and what that means on the concrete number example.

As a start, let us write the equivalent VC++ program to get an idea about what instructions end up being run:

typedef double (*func)(int, const double*, int, const double*, int);
int main()
{
    double a[3];
    double b[3];
    std::cin >> a[0];
    std::cin >> a[1];
    std::cin >> a[2];
    std::cin >> b[0];
    std::cin >> b[1];
    std::cin >> b[2];

    func cblas_ddot;
    HINSTANCE rt = LoadLibrary(TEXT("mkl_rt.dll"));
    cblas_ddot = (func)GetProcAddress(rt, "cblas_ddot");
    double res_rt = cblas_ddot(3, a, 1, b, 1);
    std::cout.precision(25);
    std::cout << res_rt;
}

Let us try to run this on each machine using Visual Studio's assembly debugger, starting with the i7 (machine 1)/the machine supporting only AVX2; here, in each case, we note all YMM registers modified by the instruction; for example, YMM4 and YMM5 are initialized with the values of a and b respectively, after the vfmadd231pd, YMM3 contains the element-wise product of the two arrays, and that after the vaddsd, the lower part of YMM5 contains the result:

vmaskmovpd  ymm4,ymm5,ymmword ptr [rbx]
YMM4 = 0000000000000000-BFEF9C656BB84218-BFB3E64ABC939CC1-3FC13BC946A68994 
vmaskmovpd  ymm5,ymm5,ymmword ptr [r9]
YMM5 = 0000000000000000-BFE026E9B3AD5464-BFA3057691D85EDE-BFEB9951B813250D 
vfmadd231pd ymm3,ymm5,ymm4
YMM3 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vaddpd      ymm1,ymm3,ymm1  
YMM1 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC 
vaddpd      ymm0,ymm2,ymm0  
vaddpd      ymm2,ymm1,ymm0
YMM2 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC 
vhaddpd     ymm3,ymm2,ymm2
YMM3 = 3FDFE946951928C9-3FDFE946951928C9-BFBCFCC53F6313B2-BFBCFCC53F6313B2 
vperm2f128  ymm4,ymm3,ymm3,1
YMM4 = BFBCFCC53F6313B2-BFBCFCC53F6313B2-3FDFE946951928C9-3FDFE946951928C9 
vaddsd      xmm5,xmm3,xmm4
YMM5 = 0000000000000000-0000000000000000-BFBCFCC53F6313B2-3FD8AA15454063DC
vmovsd      qword ptr [rsp+90h],xmm5

The same experiment on machine 2, the one supporting AVX-512, gives the following result (where we give only the lower half of the ZMM registers):

vmovupd     zmm5{k1}{z},zmmword ptr [r12]
ZMM5 = 0000000000000000-BFEF9C656BB84218-BFB3E64ABC939CC1-3FC13BC946A68994 
vmovupd     zmm4{k1}{z},zmmword ptr [r9]  
ZMM4 = 0000000000000000-BFE026E9B3AD5464-BFA3057691D85EDE-BFEB9951B813250D 
vfmadd231pd zmm3,zmm4,zmm5
ZMM3 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vaddpd      zmm17,zmm1,zmm0  
mov         eax,0F0h  
kmovw       k1,eax  
vaddpd      zmm16,zmm3,zmm2
ZMM16= 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vaddpd      zmm19,zmm16,zmm17 
ZMM19= 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC 
mov         eax,0Ch  
kmovw       k2,eax  
vcompresspd zmm18{k1}{z},zmm19  
vaddpd      zmm21,zmm18,zmm19 
ZMM21= 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vcompresspd zmm20{k2}{z},zmm21
ZMM20= 0000000000000000-0000000000000000-0000000000000000-3FDFE946951928C9
vaddpd      zmm0,zmm20,zmm21
ZMM0 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-3FD87AC4BCE238CE 
vhaddpd     xmm1,xmm0,xmm0
ZMM1 = 0000000000000000-0000000000000000-3FD8AA15454063DD-3FD8AA15454063DD 
vmovsd      qword ptr [rsp+88h],xmm1

Comparing the two, we first note that the discrepancy is a single bit, 3FD8AA15454063DC vs. 3FD8AA15454063DD, but we now also see how it arises: In the AVX2 case, we perform the horizontal add on what corresponds to the 0th and 1st entries of the vectors, while in the AVX-512 case, we're using the 0th and 2nd entries. That is, it would seem that the discrepancy simply boils down to the discrepancy between what you get by naively computing v[0]*u[0] + v[2]*u[2] + v[1]*u[1] and v[0]*u[0] + v[1]*u[1] + v[2]*u[2]. Indeed, comparing the two we find the exact same discrepancy:

In [34]: '%.25f' % (v[0]*u[0] + v[2]*u[2] + v[1]*u[1])
Out[34]: '0.3853810478481685675156143'

In [35]: '%.25f' % (v[0]*u[0] + v[1]*u[1] + v[2]*u[2])
Out[35]: '0.3853810478481685120044631'

score 0 · Answer 2 · answered Dec 23 '19 at 19:20

If you need bit-wise equal answer: Did you try MKL_ENABLE_INSTRUCTIONS variable from the link posted by @bg2b (https://software.intel.com/en-us/mkl-linux-developer-guide-instruction-set-specific-dispatching-on-intel-architectures)? if you import MKL library and then call mkl.enable_instructions it might be too late.
In double precision (DP) world: the relative difference is -1.4404224470807435333001684021155e-16 (the absolute difference is -5.55111512e-17) which is less than C++ and Python DP machine epsilon (https://en.wikipedia.org/wiki/Machine_epsilon). So results are equal from Python correctness.

Cheers, Vladimir

Cf. the comments to the original question, I did play around with `MKL_ENABLE_INSTRUCTIONS` but was unable to use it to change the results of the Cython program. Your answer prompted me to test again though, and indeed I _was_ able to use it to change the results of the original Python script, so indeed the exact order matters, and it can indeed be used to obtain reproducible answers. — fuglede, Dec 24 '19 at 09:44

Pinning down a discrepancy in ddot between two machines

2 Answers2