7

I am trying to find the indices of nonzero entries by row in a sparse matrix: scipy.sparse.csc_matrix. So far, I am looping over each row in the matrix, and using

numpy.nonzero()

to each row to get the nonzero column indices. But this method would take over an hour to find the nonzero column entries per row. Is there a fast way to do so? Thanks!

user2498497
  • 693
  • 2
  • 14
  • 22

5 Answers5

11

Use the .nonzero() method.

indices = sp_matrix.nonzero()

If you'd like the indices as (row, column) tuples, you can use zip.

indices = zip(*sp_matrix.nonzero())
Madison May
  • 2,723
  • 3
  • 22
  • 32
5

It is relatively straightforward for a CSR matrix, so you can always do:

>>> a = sps.rand(5, 5, .2, format='csc')
>>> a.A
array([[ 0.        ,  0.        ,  0.68642384,  0.        ,  0.        ],
       [ 0.46120599,  0.        ,  0.83253467,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.07074811],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.21190832,  0.        ,  0.        ,  0.        ]])
>>> b = a.tocsr()
>>> np.split(b.indices, b.indptr[1:-1])
[array([2]), array([0, 2]), array([4]), array([], dtype=float64), array([1])]
Jaime
  • 65,696
  • 17
  • 124
  • 159
3

If you use coo_matrix this would be very easy, and the conversion between coo/csr/csc is very fast. Getting all the row and column indices separately can be done as follows:

sp_matrix = sp_matrix.tocoo()
row_ind = sp_matrix.row
col_ind = sp_matrix.col

But you can also get both sets of indices simultaneously for any of these sparse matrix types, this may be the easiest:

rows, cols = X.nonzero()

If you need to find values in a specific row, csc and csr matrices will return the nonzero entries sorted by row, coo seems to return its indices ordered by columns.

In [1]: X = coo_matrix(([1, 2, 3, 4, 5, 6], ([0, 2, 2, 0, 1, 2], [0, 0, 1, 2, 2, 2])))

In [2]: X.todense()
Out[2]: 
matrix([[1, 0, 4],
        [0, 0, 5],
        [2, 3, 6]])

In [3]: X.nonzero()
Out[3]: 
(array([0, 2, 2, 0, 1, 2], dtype=int32),
 array([0, 0, 1, 2, 2, 2], dtype=int32))

In [4]: X.tocsr().nonzero()
Out[4]: 
(array([0, 0, 1, 2, 2, 2], dtype=int32),
 array([0, 2, 2, 0, 1, 2], dtype=int32))
Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
MarkAWard
  • 1,699
  • 2
  • 16
  • 28
2

I'm assuming that your matrix is not symmetric, otherwise finding all the non-zero entries in one row is the same as finding all those in one column. What are the dimensions of the matrix you're working with and how many non-zero entries are there per column on average?

If your matrix has m rows and n columns and you store it in the CSC format, you can return all the non-zero entries in a column in O(d) time, where d is the number of non-zero entries in the column, but there is no way to return all the non-zero entries in a row in less than O(n); you have to iterate over the entire row.

I would make a copy of the matrix in CSR format and get rows from that instead of the original CSC matrix. You'll be using twice as much memory of course, so here's hoping that it's not so big as to preclude the extra overhead. That would look something like this:

A = csc_matrix((m, n), dtype = float)
<fill A>
B = csr_matrix(A)

for i in range(m):
    _, cols = B[i, :].nonzero()
    for j in cols:
        <do some stuff>

I would not use the COO format, as getting all the non-zero entries in a given row can require O(nnz) time in the worst case, where nnz is the number of non-zero entries of the entire matrix.

If you find yourself having to do things with sparse matrices very often, you may want to have a look at this book. It describes many of the common and less common sparse matrix formats and illustrates some of the differences between them. There is no one best storage format; they all have their tradeoffs.

Daniel Shapero
  • 1,869
  • 17
  • 30
0

What form do you want these indices in? For example

x=sparse.csr_matrix([[1,2,0,3,0,0],[0,0,0,1,0,0]])

In [15]: for r in x:
   ....:     print r.nonzero()

(array([0]), array([0]))
(array([0, 0]), array([0, 2]))
(array([0, 0, 0]), array([0, 1, 2]))

In [30]: [r.nonzero()[1] for r in x]  # or as list
Out[30]: [array([0]), array([0, 2]), array([0, 1, 2])]

In [16]: x.nonzero()
Out[16]: (array([0, 1, 1, 2, 2, 2]), array([0, 0, 2, 0, 1, 2]))

nonzero on the whole matrix has the same numbers, but they aren't split into sublists. But the tolil format has the same information as a list of lists.

In [18]: xl=x.tolil()
In [19]: xl.rows
Out[19]: array([[0], [0, 2], [0, 1, 2]], dtype=object)

In [23]: xc=x.tocoo()
In [24]: xc.row
Out[24]: array([0, 1, 2, 2, 1, 2])
In [25]: xc.col
Out[25]: array([0, 0, 0, 1, 2, 2])

In a coo format, the same indices are there, but the order is different. But convert it first to csr and the order is

In [29]: x.tocsr().tocoo().col
Out[29]: array([0, 0, 2, 0, 1, 2])
hpaulj
  • 221,503
  • 14
  • 230
  • 353