Numpy Performance - Outer Product of a vector with its transpose

Question

It is my understanding that the Outer Product of a vector with its transpose is symmetric in value.

Does Numpy take this into account to only do the multiplications for the upper triangle part of the output or does it calculate the whole output matrix (even though it is symmetric and time + memory could go to waste?)

What do you call outer product? To my knowledge, the outer product of two vectors is not symmetric. — Kefeng91, Jul 20 '18 at 10:53
Thanks, I edited the question. I meant the outer product of a vector with its own transpose: A*A.T — Cypher, Jul 20 '18 at 11:05
What's a `vector`? That's not formally defined in numpy, Give an example of the calc. — hpaulj, Jul 20 '18 at 12:40
Do some time tests. A `dot` may detect a transpose, say with a (n,1) and (1,n) — hpaulj, Jul 20 '18 at 13:36
@hpaulj according to this link the dot product already does this but not sure about the outer product: https://stackoverflow.com/questions/43453707/numpy-dot-too-clever-about-symmetric-multiplications — Cypher, Jul 20 '18 at 13:53
@hpaulj a vector is just a 1 dimensional array. A = Mx1 array so AxA.T = MxM matrix. — Cypher, Jul 20 '18 at 13:55
(m,) is 1d. (m,1) is 2d. which outer `x.T*x` or `outer(x,x)`. Does that link give any hints as how `dot` detects this issue? — hpaulj, Jul 20 '18 at 14:43

hpaulj · Accepted Answer · 2018-07-20T16:07:25.060

Exploring some alternatives:

In [162]: x=np.arange(100)
In [163]: np.outer(x,x)
Out[163]: 
array([[   0,    0,    0, ...,    0,    0,    0],
       [   0,    1,    2, ...,   97,   98,   99],
       [   0,    2,    4, ...,  194,  196,  198],
       ...,
       [   0,   97,  194, ..., 9409, 9506, 9603],
       [   0,   98,  196, ..., 9506, 9604, 9702],
       [   0,   99,  198, ..., 9603, 9702, 9801]])
In [164]: x1=x[:,None]
In [165]: x1*x1.T
Out[165]: 
array([[   0,    0,    0, ...,    0,    0,    0],
       [   0,    1,    2, ...,   97,   98,   99],
       [   0,    2,    4, ...,  194,  196,  198],
       ...,
       [   0,   97,  194, ..., 9409, 9506, 9603],
       [   0,   98,  196, ..., 9506, 9604, 9702],
       [   0,   99,  198, ..., 9603, 9702, 9801]])
In [166]: np.dot(x1,x1.T)
Out[166]: 
array([[   0,    0,    0, ...,    0,    0,    0],
       [   0,    1,    2, ...,   97,   98,   99],
       [   0,    2,    4, ...,  194,  196,  198],
       ...,
       [   0,   97,  194, ..., 9409, 9506, 9603],
       [   0,   98,  196, ..., 9506, 9604, 9702],
       [   0,   99,  198, ..., 9603, 9702, 9801]])

Comparing their times:

In [167]: timeit np.outer(x,x)
40.8 µs ± 63.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [168]: timeit x1*x1.T
36.3 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [169]: timeit np.dot(x1,x1.T)
60.7 µs ± 6.86 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Is dot using a transpose short cut? I don't think so, or if it does, it doesn't help in this case. I'm a little surprised that dot is slower.

In [170]: x2=x1.T
In [171]: timeit np.dot(x1,x2)
61.1 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Another method

In [172]: timeit np.einsum('i,j',x,x)
28.3 µs ± 19.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

einsum with x1 and x2 has the same times.

Interesting that matmul does as well as einsum in this case (maybe einsum is delegating to matmul?)

In [178]: timeit x1@x2
27.3 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [179]: timeit x1@x1.T
27.2 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Numpy efficient matrix self-multiplication (gram matrix) demonstrates how dot can be save time by being clever (for a 1000x1000 array).

As discussed in the links, dot can detect when one argument is the transpose of the other (probably by checking the data buffer pointer and shape and strides), and can use a BLAS function optimized for symmetric calculations. But I don't see evidence of outer doing that. And its unlikely that broadcasted multiplication would take such a step.

Thanks so much for the detailed analysis... So far it seems that `matmul` and `einsum` are faster than the other alternatives... I wonder if they stay fast when we scale the data to millions of 32bit floats or not... — Cypher, Jul 21 '18 at 05:13

Numpy Performance - Outer Product of a vector with its transpose

1 Answers1

Linked