4

I have two 3D tensors, tensor A which has shape [B,N,S] and tensor B which also has shape [B,N,S]. What I want to get is a third tensor C, which I expect to have [B,B,N] shape, where the element C[i,j,k] = np.dot(A[i,k,:], B[j,k,:]. I also want to achieve this is a vectorized way.

Some further info: The two tensors A and B have shape [Batch_size, Num_vectors, Vector_size]. The tensor C, is supposed to represent the dot product between each element in the batch from A and each element in the batch from B, between all of the different vectors.

Hope that it is clear enough and looking forward to you answers!

gorjan
  • 5,405
  • 2
  • 20
  • 40

3 Answers3

4
In [331]: A=np.random.rand(100,200,300)                                                              
In [332]: B=A

The suggested einsum, working directly from the

C[i,j,k] = np.dot(A[i,k,:], B[j,k,:] 

expression:

In [333]: np.einsum( 'ikm, jkm-> ijk', A, B).shape                                                   
Out[333]: (100, 100, 200)
In [334]: timeit np.einsum( 'ikm, jkm-> ijk', A, B).shape                                            
800 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

matmul does a dot on the last 2 dimensions, and treats the leading one(s) as batch. In your case 'k' is the batch dimension, and 'm' is the one that should obey the last A and 2nd to the last of B rule. So rewriting the ikm,jkm... to fit, and transposing A and B accordingly:

In [335]: np.einsum('kim,kmj->kij', A.transpose(1,0,2), B.transpose(1,2,0)).shape                     
Out[335]: (200, 100, 100)
In [336]: timeit np.einsum('kim,kmj->kij',A.transpose(1,0,2), B.transpose(1,2,0)).shape              
774 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not much difference in performance. But now use matmul:

In [337]: (A.transpose(1,0,2)@B.transpose(1,2,0)).transpose(1,2,0).shape                             
Out[337]: (100, 100, 200)
In [338]: timeit (A.transpose(1,0,2)@B.transpose(1,2,0)).transpose(1,2,0).shape                      
64.4 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

and verify that values match (though more often than not, if shapes match, values do to).

In [339]: np.allclose((A.transpose(1,0,2)@B.transpose(1,2,0)).transpose(1,2,0),np.einsum( 'ikm, jkm->
     ...:  ijk', A, B))                                                                              
Out[339]: True

I won't try to measure memory usage, but the time improvement suggests it too is better.

In some cases einsum is optimized to use matmul. Here that doesn't seem to be the case, though we could play with its parameters. I'm a little surprised the matmul is doing so much better.

===

I vaguely recall another SO about matmul taking a short cut when the two arrays are the same thing, A@A. I used B=A in these tests.

In [350]: timeit (A.transpose(1,0,2)@B.transpose(1,2,0)).transpose(1,2,0).shape                      
60.6 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [352]: B2=np.random.rand(100,200,300)                                                             
In [353]: timeit (A.transpose(1,0,2)@B2.transpose(1,2,0)).transpose(1,2,0).shape                     
97.4 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

But that only made a modest difference.

In [356]: np.__version__                                                                             
Out[356]: '1.16.4'

My BLAS etc is standard Linux, nothing special.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • 1
    Nice answer. I get similar timings with `einsum` and `@` (np ver 1.15.3). Is your improvement because you are using np>1.16.0? – Brenlla Jun 27 '19 at 10:11
  • I updated `numpy` and now I get similar timings. Probably because of [this](https://docs.scipy.org/doc/numpy/release.html#matmul-is-now-a-ufunc) – Brenlla Jun 27 '19 at 13:49
  • I mean similar timings to the ones in the answer – Brenlla Jun 27 '19 at 14:04
  • My numpy is 1.16.4. I also used `B=A`, and matmul takes a modest shortcut in the `A@A` case, but that doesn't account for most of the time difference. – hpaulj Jun 27 '19 at 15:51
  • Yes, I got timings like the one in your answer, after updating numpy – Brenlla Jun 27 '19 at 16:03
2

I think you can use einsum such as:

np.einsum( 'ikm, jkm-> ijk', A, B)

with the subscripts 'ikm, jkm-> ijk', you can specify which dimension are reduced with the Einstein convention. The third dimension of both arrays A and B here named 'm' will be reduced as the dot operation does on vectors.

Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • Thanks! Can you translate that in matrix multiplication ops? – gorjan Jun 26 '19 at 13:57
  • @gorjan not sure sure I understand what you want me to do? – Ben.T Jun 26 '19 at 14:01
  • Instead of going with something like `einsum` I am wondering whether your solution can be written as one/several matrix multiplications (I assume it can). From what I know, `einsum` is just syntactic sugar that wraps one or more invocations of `matmul`. – gorjan Jun 26 '19 at 14:05
  • @gorjan I see and maybe there is an easy solution, but unfortunately, I don't seem to be able to find a way that will not do extra calculation, that would end up to do like in the solution with `tensordot` of *Learning is a mess* – Ben.T Jun 26 '19 at 14:49
  • 2
    Transpose the arrays so `kim,kmj->kij`, then use `@` – hpaulj Jun 26 '19 at 15:18
  • Hey @hpaulj, can you post an answer as well? Even though the accepted answer works just fine, it creates a huge tensor under the hood to perform the tensor reduction which makes it unfeasible to use it in my case. – gorjan Jun 26 '19 at 22:16
-1

Try:

C = np.diagonal( np.tensordot(A,B, axes=(2,2)), axis1=1, axis2=3)

from https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html#numpy.tensordot

Explanation

The solution is a composition of two operations. First the tensor product between A and B over their third axis as you want it. This outputs a rank-4 tensor, that you want to reduce to a rank-3 tensor by taking equal indices on axis 1 and 3 (your k in your notation, note that tensordot gives a different axis order than your maths). This can be done by taking the diagonal, as you can do when reducing a matrix to the vector of its diagonal entries.

Learning is a mess
  • 7,479
  • 7
  • 35
  • 71
  • Please try again, first version had wrong axis indices. – Learning is a mess Jun 26 '19 at 13:50
  • 1
    It would be useful if you can explain a bit. One liners are not much help, even if it works. – Jovan Andonov Jun 26 '19 at 13:54
  • 3
    This method, as it is now, is very inefficient compared to the above `einsum`. You are creating an array `N` times larger than `C` and then taking the diagonal. It is at least 10x slower than `einsum` for small arrays and potentially orders of magnitude slower for large arrays. – Brenlla Jun 26 '19 at 14:10
  • @Brenlla Correct I just benchmarked them and mine is 10x slower, plus larger temporary memory footprint I guess. I do not understand the downvote though.. – Learning is a mess Jun 26 '19 at 14:37