11

Given a Scipy CSC Sparse matrix "sm" with dimensions (170k x 170k) with 440 million non-null points and a sparse CSC vector "v" (170k x 1) with a few non-null points, is there anything that can be done to improve the performance of the operation:

resul = sm.dot(v)

?

Currently it's taking roughly 1 second. Initializing the matrices as CSR increased the time up to 3 seconds, so CSC performed better.

SM is a matrix of similarities between products and V is the vector that represents which products the user bought or clicked on. So for every user sm is the same.

I'm using Ubuntu 13.04, Intel i3 @3.4GHz, 4 Cores.

Researching on SO I read about Ablas package. I typed into the terminal:

~$ ldd /usr/lib/python2.7/dist-packages/numpy/core/_dotblas.so

Which resulted in:

    linux-vdso.so.1 =>  (0x00007fff56a88000)
    libblas.so.3 => /usr/lib/libblas.so.3 (0x00007f888137f000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8880fb7000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8880cb1000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f888183c000)

And for what I understood this means that I'm already using a high performance package from Ablas. I'm still not sure though if this package already implements parallel computing but it looks like it doesn't.

Could multi-core processing help to boost performance? If so, is there any library that could be helpful in python?

I was also considering the idea of implementing this in Cython but I don't know if this would lead to good results.

Thanks in advance.

Willian Fuks
  • 11,259
  • 10
  • 50
  • 74
  • How long is the process taking to run at present? Details of some parallel processing modules are covered here: https://wiki.python.org/moin/ParallelProcessing – ChrisProsser Sep 03 '13 at 15:37
  • @ChrisProsser It depends. If the user bought many items than it can take up to 15s, but on average it's taking 1.34s just in the multiplicative operation. Thanks for the link, I'm already reading it =) – Willian Fuks Sep 03 '13 at 16:16
  • 1
    While np.dot uses blas, and various parallel and multicore enhancements, the sparse dot might not. The dense version of sm might be too large. But it might be instructive to compare the sparse and dense speeds with a smaller matrix. – hpaulj Sep 03 '13 at 16:33
  • @Will I would be very interested to hear what performance gain you manage to get if you try either Cython or a parallel processing module. – ChrisProsser Sep 04 '13 at 11:24
  • @ChrisProsser I still plan to use Cython to solve this because I want to learn more about Cython. One thing though is that Jaime's solution dropped the time consumed to 6ms so I'm not sure if Cython would improve much more. I'll give it a try though and let you know =) – Willian Fuks Sep 04 '13 at 15:35

2 Answers2

13

The sparse matrix multiplication routines are directly coded in C++, and as far as a quick look at the source reveals, there doesn't seem to be any hook to any optimized library. Furthermore, it doesn't seem to be taking advantage of the fact that the second matrix is a vector to minimize calculations. So you can probably speed things up quite a bit by accessing the guts of the sparse matrix, and customizing the multiplication algorithm. The following code does so in pure Python/Numpy, and if the vector really has "a few non-null points" it matches the speed of scipy's C++ code: if you implemented it in Cython, the speed increase should be noticeable:

def sparse_col_vec_dot(csc_mat, csc_vec):
    # row numbers of vector non-zero entries
    v_rows = csc_vec.indices
    v_data = csc_vec.data
    # matrix description arrays
    m_dat = csc_mat.data
    m_ind = csc_mat.indices
    m_ptr = csc_mat.indptr
    # output arrays
    sizes = m_ptr.take(v_rows+1) - m_ptr.take(v_rows)
    sizes = np.concatenate(([0], np.cumsum(sizes)))
    data = np.empty((sizes[-1],), dtype=csc_mat.dtype)
    indices = np.empty((sizes[-1],), dtype=np.intp)
    indptr = np.zeros((2,), dtype=np.intp)

    for j in range(len(sizes)-1):
        slice_ = slice(*m_ptr[[v_rows[j] ,v_rows[j]+1]])
        np.multiply(m_dat[slice_], v_data[j], out=data[sizes[j]:sizes[j+1]])
        indices[sizes[j]:sizes[j+1]] = m_ind[slice_]
    indptr[-1] = len(data)
    ret = sps.csc_matrix((data, indices, indptr),
                         shape=csc_vec.shape)
    ret.sum_duplicates()

    return ret

A quick explanation of what is going on: a CSC matrix is defined in three linear arrays:

  • data contains the non-zero entries, stored in column major order.
  • indices contains the rows of the non-zero entries.
  • indptr has one entry more than the number of columns of the matrix, and items in column j are found in data[indptr[j]:indptr[j+1]] and are in rows indices[indptr[j]:indptr[j+1]].

So to multiply by a sparse column vector, you can iterate over data and indices of the column vector, and for each (d, r) pair, extract the corresponding column of the matrix and multiply it by d, i.e. data[indptr[r]:indptr[r+1]] * d and indices[indptr[r]:indptr[r+1]].

Cœur
  • 37,241
  • 25
  • 195
  • 267
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • This solution was beautiful! I don't quite understand why it's faster than the "dot" operation but it for sure worked very well. The time dropped from 1s to roughly 6ms...! Sorting the recommendation list became the new bottleneck now lol! Thanks a lot Jaime, wonderful solution! – Willian Fuks Sep 04 '13 at 15:32
  • You got that speed increase with the code as is, without rewriting any stuff in Cython? That's surprising, and not what I saw on my tests. Either your matrix and vector are more sparse than your question says, or you may want to double check that there is no typo, and you are getting the same result as from `.dot`. – Jaime Sep 04 '13 at 15:40
  • I double checked the result and it's correct. If you didn't observe the same time drop I wonder if I had something wrong in my previous code...now it's working pretty well though. – Willian Fuks Sep 04 '13 at 16:13
  • @jaime A wonderful explanation. In general, a Cython implementation won't make much of a difference here because numpy and scipy are calling C/C++ code. Ditto with numpy/scipy vectorized code. Haven't tested it but maybe the cython.parallel module which is based on OpenMP may help. – Henry Thornton Dec 13 '13 at 20:38
  • @dbv Well, there's the looping over the columns: if you had the three arrays in C/C++/Cython, you could do everything directly, without having the Python overhead of building slices and using them to extract views of the arrays. – Jaime Dec 14 '13 at 01:40
  • Using the CSR format, test results with and w/out cython didn't show a significant performance difference. Ah, wait a minute - re-reading your post. Are you suggesting cython'izing the above python code instead of using the scipy sparse multiplication? – Henry Thornton Dec 14 '13 at 09:44
  • @dbv Yes, exactly, last time I looked scipy.sparse wasn't handling some sepcial cases of multiplication differently from a general algorithm, so some times you can even beat the default C++ implementation recoding the algorithm in Python/Numpy. If you did the coding in C/C++/Cython you should get a very significant improvement. – Jaime Dec 14 '13 at 17:34
  • But, an optimized library (eg. Atlas, OpenBlas, MKL) would close the performance gap? Agree that scipy.sparse doesn't have options for special cases. – Henry Thornton Dec 14 '13 at 19:46
  • 3
    I ended up building a 1D-only C/Cython version of something like this without slicing and fully parallelized (python package called "sparse_dot" https://github.com/pluralsight/sparse_dot). It's main goal was to compute cosine distance between each pair of a large number of sparse arrays. – David Nov 08 '16 at 21:48
4

Recently i had the same issue. I solved it like this.

def sparse_col_vec_dot(csc_mat, csc_vec):
    curr_mat = csc_mat.tocsr()
    ret curr_mat* csc_vec

The trick here is we have to make one version of the matrix as row representation and the other version as column representation.

Vaali
  • 151
  • 2
  • 11