I found it to be somewhat more efficient to use csc_matrix.multiply
:
In [77]: %timeit (x.transpose() * y)[0,0]
98.6 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [78]: %timeit x.T.dot(y)[0,0]
133 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [79]: %timeit x.multiply(y).sum(0)[0, 0]
77.7 µs ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [80]: %timeit x.multiply(y).sum()
86.1 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
It might also be worth pointing out that csc_array
behaves rather differently from csc_matrix
when it comes to these operations; the shapes come out differently, transpose
no longer works, but np.dot
works off the shelf:
In [85]: x2 = csc_array(x)
In [87]: y2 = csc_array(y)
In [108]: %timeit np.dot(x2, y2).sum()
70.9 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [115]: %timeit x2.T.dot(y2)[0, 0]
103 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [117]: %timeit x2.multiply(y2).sum(0)[0]
70.5 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [120]: %timeit x2.multiply(y2).sum()
59.6 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
And if you end up with rows instead of columns -- i.e. csr_array
instead of csc_array
-- then the fastest solution above is no longer the fastest (and is in fact extremely slow for very large vectors)!
In [123]: x3 = csr_array(x.T)
In [124]: y3 = csr_array(y.T)
In [130]: %timeit np.dot(x3, y3).sum()
72.2 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [134]: %timeit x3.dot(y3.T)[0, 0]
92.6 µs ± 358 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [139]: %timeit x3.multiply(y3).sum(1)[0]
62.7 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [140]: %timeit x3.multiply(y3).sum()
76.4 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
An example with a very large (but very sparse) vector:
In [142]: a = csr_array([1, 2], shape=(1, 1_000_000_000))
In [147]: %time np.dot(a, a).sum()
CPU times: user 205 ms, sys: 926 ms, total: 1.13 s
Wall time: 1.13 s
Out[147]: 5
In [149]: %time a.dot(a.T)[0, 0]
CPU times: user 854 ms, sys: 513 ms, total: 1.37 s
Wall time: 1.37 s
Out[149]: 5
In [151]: %time a.multiply(a).sum(1)[0]
CPU times: user 181 µs, sys: 0 ns, total: 181 µs
Wall time: 182 µs
Out[151]: 5
In [152]: %time a.multiply(a).sum()
CPU times: user 229 ms, sys: 902 ms, total: 1.13 s
Wall time: 1.13 s
Out[152]: 5
If you can use Cython or Numba, it's possible to do it even faster. For Cython, an implementation to calculate the product of two shape (1, n)
csr_array
with float64
values which gives me some 10%-20% performance improvement over a.multiply(b).sum(1)[0]
could look something like the following:
@cython.boundscheck(False)
@cython.wraparound(False)
def cython_dot(a, b):
cdef int na = a.nnz
cdef int nb = b.nnz
cdef double[:] a_data_view = a.data
cdef int[:] a_indices_view = a.indices
cdef double[:] b_data_view = b.data
cdef int[:] b_indices_view = b.indices
cdef int a_ptr = -1
cdef int b_ptr = -1
cdef int aiv, biv
cdef double result = 0
cdef bint a_update = True
cdef bint b_update = True
while True:
if a_update:
a_ptr += 1
aiv = a_indices_view[a_ptr]
a_update = False
if b_update:
b_ptr += 1
biv = b_indices_view[b_ptr]
b_update = False
if a_ptr == na or b_ptr == nb:
break
if aiv < biv:
a_update = True
elif aiv > biv:
b_update = True
else:
result += a_data_view[a_ptr] * b_data_view[b_ptr]
a_update = True
b_update = True
return result
Here,
result += a_data_view[a_ptr] * b_data_view[b_ptr]
could also use libc.math.fma
if higher precision is necessary (but this makes it a bit slower in my testing):
result = fma(a_data_view[a_ptr], b_data_view[b_ptr], result)
A fast Numba implementation could be similar and is about as fast in my testing.