0
import numpy as np

N = 4
m = 2
mat = np.random.rand(N,2*m)
vec = np.random.rand(N)
dot1 = np.dot(vec,mat)
dot2 = np.concatenate([np.dot(vec,mat[:,:m]), np.dot(vec,mat[:,m:])])
print('Max difference:', np.max(np.abs(dot2-dot1)))

Output (10 sample runs):

Max difference: 1.1102230246251565e-16
Max difference: 0.0
Max difference: 0.0
Max difference: 0.0
Max difference: 1.1102230246251565e-16
Max difference: 0.0
Max difference: 1.1102230246251565e-16
Max difference: 2.220446049250313e-16
Max difference: 1.1102230246251565e-16
Max difference: 1.1102230246251565e-16

by running on a MacBook Pro:

Python version 3.9.1 (v3.9.1:1e5d33e9b9, Dec  7 2020, 12:10:52) 
[Clang 6.0 (clang-600.0.57)]
Numpy version 1.25.1

I understand that the exact order of additions may depend on the dimensions (N, m) of the matrix, but would like to know how exactly it depends, preferably with pointers to the numpy source code.

I can reproduce this on a Linux system as well:

Python version 3.8.10 (default, May 26 2023, 14:05:08) 
[GCC 9.4.0]
Numpy version 1.17.4

But strangely, on a Linux HPC cluster, the same code always yield 0.0:

Python version 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:45:29) 
[GCC 10.4.0]
Numpy version 1.24.4

I have to change the datatype of the arrays and enlarge dimensions to make dot1 and dot2 different:

import numpy as np

N = 12
m = 8
mat = np.random.rand(N,2*m).astype('float32')
vec = np.random.rand(N).astype('float32')
dot1 = np.dot(vec,mat)
dot2 = np.concatenate([np.dot(vec,mat[:,:m]), np.dot(vec,mat[:,m:])])
print('Max difference:', np.max(np.abs(dot2-dot1)))

Output (5 sample runs):

Max difference: 2.3841858e-07
Max difference: 4.7683716e-07
Max difference: 4.7683716e-07
Max difference: 2.3841858e-07
Max difference: 2.3841858e-07

But whether this works depend on which node (of the HPC cluster) it runs on! On one node I got the above results, but on another node I still get all 0.0. I guess it might depend on the specific CPU's implementation of vectorization? Indeed the node that produces different dot1 and dot2 has Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, while the node that produces the same dot1 and dot2 has Intel(R) Xeon(R) CPU E7-8860 v4 @ 2.20GHz.

Update July 20, 2023 Here are my three main questions:

  1. Suppose that np.dot treats each column of mat equally and applies the same order of additions, say (v[0]*mat[0,i]+v[1]*mat[1,i]) + (v[2]*mat[2,i] + v[3]*mat[3,i]), then it shouldn't matter whether the column is in a N x 2m matrix or in a N x m matrix, and we shouldn't observe any difference in the results. Therefore, np.dot appears to use different order of addition depending on the dimensions of the matrix (and possibly different order for different column of the same matrix), and I want to understand why it needs to do so. Maybe accessing memory in a certain order speeds up things? Unlike matrix multiplication, I don't think dot product has any fancy algorithm that's not just taking products and summing them up, right? Apparently, the numpy repo contains 36.8% C/C++ code, so I think it may contain some of the low-level code responsible for this behavior, and it would be wonderful if an expert could help be navigate it.

  2. In the comment it's pointed out that 1e-16 is a small error, but of course the error gets bigger as the vector grows higher dimension. I performed experiments in this notebook where I scale the dimension by powers of 2, 3, and 10. Weirdly, the error appears to grow super-linearly in the dimension (see the red markers). However, if I exchange the roles of rows and columns (see the function run_row_exp), then it grows sub-linearly. What is causing this asymmetry? (Notice that the "row" version runs faster because it doesn't need to copy the part of the matrix when forming mat[:m] or mat[m:]).

enter image description here

  1. As explained earlier, different results were observed on two different nodes on the same cluster, with different CPU but the same software installed. Do you think that the BLAS routines underlying np.dot were compiled to different binaries on the two different nodes? So the compiler detected which hardware it ran on and selected different subroutines? Or maybe multiple subroutines may all be compiled and NumPy selects which to use when the Python code is run.

Update July 21, 2023 I made an attempt to find out which order of addition are actually used in the original example (i.e. dot product of a 4D vector with either a 4x4 matrix or a 4x2 matrix), using the code in this notebook. Apparently, for 4x2 matrices the order (0+1)+(2+3) is always used, but to my surprise, for 4x4 matrices none of the orders passes 20 randomly generated test cases. (I originally ran the code on my MacBook but the result can be replicated exactly on Colab using the same matrices and vectors.) I'm suspecting that some 4D dot product instruction might have been used, but wouldn't such instructions still break down to a series of binary addition and multiplication operations? How could the results match none of the addition orders? And also I don't see a reason to use such an instruction in the 4x4 case but not in the 4x2 case.

Junyan Xu
  • 103
  • 1
  • 4
  • 4
    Differences on the order of `1e-16` are considered `np.isclose` for floats. Where possible `dot` is performed with optimized `BLAS` routines, so are not readily available to you (a Python level programmer). `dot` involves a sum of products, so can sensitive to the order in which those sums are taken, especially if the "subsequent" products nearly 'cancel each other out,. – hpaulj Jul 17 '23 at 20:18
  • 1
    https://stackoverflow.com/questions/76551058/comparing-accurrate-dot-product-with-mpmath - `dot` can be calculated as a sum of products. In some cases, using `math.fsum` can improve on that calculation, by taking a 'smarter' sum. But speed will suffer. – hpaulj Jul 17 '23 at 20:31
  • Thanks for the comments. I just edited to make my questions explicit and add new data; can you take another look? – Junyan Xu Jul 20 '23 at 21:38
  • Can you post the output of `np.show_config()` for both platforms? Should show what BLAS libraries are in use. – Nick ODell Jul 20 '23 at 23:17
  • @NickODell I'm temporarily unable to get allocated nodes with the Intel CPUs in the question, so here is the output on a node with AMD CPU that also always yields 0.0: https://gist.github.com/alreadydone/30d7f109e6c36381aab8565f808e0018 – Junyan Xu Jul 21 '23 at 02:33
  • @NickODell I've updated the gist with complete np.show_config output with various CPUs. It seems nonzero difference in the float32 version may be attributed to the availability of AVX512F,AVX512CD,AVX512_SKX. The different behaviors of the float64 version on Colab/MacBook vs. the HPC cluster may be due to different BLAS/LAPACK libraries (openblas64__info/blas_ilp64_opt_info/openblas64__lapack_info/lapack_ilp64_opt_info on Colab, blas_info/blas_opt_info/lapack_info/lapack_opt_info on HPC cluster). – Junyan Xu Jul 21 '23 at 16:51
  • It could be that vectorization is changing what order floating point operations happen in. For example, AVX512 can be used to add and multiply up to 16 different values in parallel. Then at the end, it adds up all the individual values in that vector register. So you have one value in the register which represents all sums from the matrix in positions 16k + 1, one which represents all sums from 16k + 2, etc. Since floating point is non-associative, changing the order can change the output. – Nick ODell Jul 21 '23 at 17:06
  • Thanks! I tried to figure out which order of addition is used in each case, but surprisingly found that in the 4x4 case none of the orders matches. Please see the latest update to the main question. – Junyan Xu Jul 21 '23 at 20:13

0 Answers0