4

Suppose I have two arrays of type int8. I want to use einsum on them in such a way that all the calculations will be done as int64, but I don't want to convert the whole arrays to int64. If I understand correctly, this is what the dtype argument is for. But it seems to not always work.

An example where it does work as expected:

>>> A = np.array([[123, 45],[67,89]], dtype='int8')
>>> np.einsum(A,[0,1],A,[0,1],[1]) ## => integer overflow
array([-94, -38], dtype=int8)
>>> np.einsum(A,[0,1],A,[0,1],[1], dtype='int64') ## => no overflow
array([19618,  9946], dtype=int64)

An example where it does not work as expected:

>>> A = np.array([[123, 45],[67,89]], dtype='int8')
>>> np.einsum(A,[0,1],A,[1,2],[0,2]) ## => integer overflow
array([[-32,  68],
       [124, -72]], dtype=int8)
>>> np.einsum(A,[0,1],A,[1,2],[0,2], dtype='int64') ## => should not overflow, but it does
array([[-32,  68],
       [124, -72]], dtype=int8)

For these examples I used python 3.6.4 on windows with numpy 1.14.0

The same happens when I try other dtypes, e.g., float64, and even if I use casting='unsafe'.

Why does this happen, and how can I make it work?



UPDATE:
einsum has optimize=True as default, at least in numpy 1.14.0. When using optimize=False, it works as expected (although much slower, for large arrays):

>>> A = np.array([[123, 45],[67,89]], dtype='int8')
>>> np.einsum(A,[0,1],A,[1,2],[0,2]) ## => integer overflow
array([[-32,  68],
       [124, -72]], dtype=int8)
>>> np.einsum(A,[0,1],A,[1,2],[0,2], dtype='int64', optimize=False)
array([[18144,  9540],
       [14204, 10936]], dtype=int64)


From a brief look at einsum.py, it seems that when optimize=True, then it checks whether it is better to use numpy.tensordot instead of the einsum implementation. If it is (which it should be in my 2nd example, as it is just a regular matrix multiplication), then it uses tensordot, but it does not pass the dtype argument to it. Indeed, tensordot does not even have a dtype argument.


If this is correct, then a followup question is required (which is probably worth its own post?):
How can I matrix-multiply two matrices of a certain dtype, say, int8, in such a way that all the calculations will be done as, say, int64 or float64 (and so will not overflow / loose precision, unless an int64/float64 would too), but without having to convert them first to the desired type, and moreover, the operation's implementation itself should not convert the matrices as a whole, only small parts each time (so the memory required for the operation will not be much larger than the memory required just to hold these matrices and the result)?
Can this be done with efficiency comparable to numpy.dot?

ea1
  • 61
  • 3
  • FWIW I was unable to reproduce this on Debian Linux with either NumPy 1.13.0 or 1.14.2 (using Python 3.6.0). Strange that it should fail for some setups - hopefully there's a workaround, or an easy bugfix if needs be. If asking here doesn't yield an explanation, you could try opening an issue with [NumPy on GitHub](https://github.com/numpy/numpy/issues). – Alex Riley Apr 01 '18 at 10:17
  • I also cannot reproduce this. I'm using NumPy 1.13.3 on Python 3.6.3 on OS X (x86-64). – John Zwinck Apr 01 '18 at 10:38
  • That `optimize=True` default was a controversial change in 1.14.?. Chunking will be awkward for `dot` products, since each result element uses a whole row and a whole column. – hpaulj Apr 01 '18 at 16:10
  • If you wan't to write out and compile the loops shown here, you can pass int8 and get int64 result without overflow. https://stackoverflow.com/a/49011917/4045774 This will also outperform einsum with enabled optimization, according to a short test I have done. But if you want real speed you have to use floats/input and output and go for tensordot. The reson for this is the BLAS backend that only works with floats. (np.dot is also very slow on integers because the task isn't passed to the BLAS library) – max9111 Apr 11 '18 at 17:14
  • @hpaulj Using advanced algorithms using chinking/blocking is extremely important for dot multiplications (reducing cache misses). This is actually an quite important reason why a dot product on large arrays is much faster with a BLAS backend than a naive algorithm Fortran/C/Cython/Numba solution. https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm – max9111 Apr 11 '18 at 17:32

0 Answers0