import numpy as np
N = 4
m = 2
mat = np.random.rand(N,2*m)
vec = np.random.rand(N)
dot1 = np.dot(vec,mat)
dot2 = np.concatenate([np.dot(vec,mat[:,:m]), np.dot(vec,mat[:,m:])])
print('Max difference:', np.max(np.abs(dot2-dot1)))
Output (10 sample runs):
Max difference: 1.1102230246251565e-16
Max difference: 0.0
Max difference: 0.0
Max difference: 0.0
Max difference: 1.1102230246251565e-16
Max difference: 0.0
Max difference: 1.1102230246251565e-16
Max difference: 2.220446049250313e-16
Max difference: 1.1102230246251565e-16
Max difference: 1.1102230246251565e-16
by running on a MacBook Pro:
Python version 3.9.1 (v3.9.1:1e5d33e9b9, Dec 7 2020, 12:10:52)
[Clang 6.0 (clang-600.0.57)]
Numpy version 1.25.1
I understand that the exact order of additions may depend on the dimensions (N
, m
) of the matrix, but would like to know how exactly it depends, preferably with pointers to the numpy source code.
I can reproduce this on a Linux system as well:
Python version 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0]
Numpy version 1.17.4
But strangely, on a Linux HPC cluster, the same code always yield 0.0
:
Python version 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:45:29)
[GCC 10.4.0]
Numpy version 1.24.4
I have to change the datatype of the arrays and enlarge dimensions to make dot1
and dot2
different:
import numpy as np
N = 12
m = 8
mat = np.random.rand(N,2*m).astype('float32')
vec = np.random.rand(N).astype('float32')
dot1 = np.dot(vec,mat)
dot2 = np.concatenate([np.dot(vec,mat[:,:m]), np.dot(vec,mat[:,m:])])
print('Max difference:', np.max(np.abs(dot2-dot1)))
Output (5 sample runs):
Max difference: 2.3841858e-07
Max difference: 4.7683716e-07
Max difference: 4.7683716e-07
Max difference: 2.3841858e-07
Max difference: 2.3841858e-07
But whether this works depend on which node (of the HPC cluster) it runs on! On one node I got the above results, but on another node I still get all 0.0
. I guess it might depend on the specific CPU's implementation of vectorization? Indeed the node that produces different dot1
and dot2
has Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, while the node that produces the same dot1
and dot2
has Intel(R) Xeon(R) CPU E7-8860 v4 @ 2.20GHz.
Update July 20, 2023 Here are my three main questions:
Suppose that
np.dot
treats each column ofmat
equally and applies the same order of additions, say(v[0]*mat[0,i]+v[1]*mat[1,i]) + (v[2]*mat[2,i] + v[3]*mat[3,i])
, then it shouldn't matter whether the column is in a N x 2m matrix or in a N x m matrix, and we shouldn't observe any difference in the results. Therefore,np.dot
appears to use different order of addition depending on the dimensions of the matrix (and possibly different order for different column of the same matrix), and I want to understand why it needs to do so. Maybe accessing memory in a certain order speeds up things? Unlike matrix multiplication, I don't think dot product has any fancy algorithm that's not just taking products and summing them up, right? Apparently, the numpy repo contains 36.8% C/C++ code, so I think it may contain some of the low-level code responsible for this behavior, and it would be wonderful if an expert could help be navigate it.In the comment it's pointed out that 1e-16 is a small error, but of course the error gets bigger as the vector grows higher dimension. I performed experiments in this notebook where I scale the dimension by powers of 2, 3, and 10. Weirdly, the error appears to grow super-linearly in the dimension (see the red markers). However, if I exchange the roles of rows and columns (see the function
run_row_exp
), then it grows sub-linearly. What is causing this asymmetry? (Notice that the "row" version runs faster because it doesn't need to copy the part of the matrix when formingmat[:m]
ormat[m:]
).
- As explained earlier, different results were observed on two different nodes on the same cluster, with different CPU but the same software installed. Do you think that the BLAS routines underlying
np.dot
were compiled to different binaries on the two different nodes? So the compiler detected which hardware it ran on and selected different subroutines? Or maybe multiple subroutines may all be compiled and NumPy selects which to use when the Python code is run.
Update July 21, 2023 I made an attempt to find out which order of addition are actually used in the original example (i.e. dot product of a 4D vector with either a 4x4 matrix or a 4x2 matrix), using the code in this notebook. Apparently, for 4x2 matrices the order (0+1)+(2+3) is always used, but to my surprise, for 4x4 matrices none of the orders passes 20 randomly generated test cases. (I originally ran the code on my MacBook but the result can be replicated exactly on Colab using the same matrices and vectors.) I'm suspecting that some 4D dot product instruction might have been used, but wouldn't such instructions still break down to a series of binary addition and multiplication operations? How could the results match none of the addition orders? And also I don't see a reason to use such an instruction in the 4x4 case but not in the 4x2 case.