12

I'm trying to convert the following MATLAB code to Python and am having trouble finding a solution that works in any reasonable amount of time.

M = diag(sum(a)) - a;
where = vertcat(in, out);
M(where,:) = 0;
M(where,where) = 1;

Here, a is a sparse matrix and where is a vector (as are in/out). The solution I have using Python is:

M = scipy.sparse.diags([degs], [0]) - A
where = numpy.hstack((inVs, outVs)).astype(int)
M = scipy.sparse.lil_matrix(M)
M[where, :] = 0  # This is the slowest line
M[where, where] = 1
M = scipy.sparse.csc_matrix(M)

But since A is 334863x334863, this takes like three minutes. If anyone has any suggestions on how to make this faster, please contribute them! For comparison, MATLAB does this same step imperceptibly fast.

Thanks!

Alex Reinking
  • 16,724
  • 5
  • 52
  • 86
  • Is `M` still sparse after that slow step? – Dennis Jaheruddin Nov 05 '13 at 14:40
  • 3
    @DennisJaheruddin yeah it is, but assignment forcefully adds all those `0`s, adding `len(where)*334863` elements to the matrix. they might be later pruned with `eliminate_zeros()`, but operation can't be considered effective. – alko Nov 05 '13 at 14:44
  • possible duplicate of [scipy.sparse : Set row to zeros](http://stackoverflow.com/questions/12129948/scipy-sparse-set-row-to-zeros) – askewchan Nov 05 '13 at 14:56
  • Oops, it did generate the same answer, didn't it? Apologies! That question didn't come up in any of my searches for fancy indexing in python or in my searches for matlab-like assignment for scipy sparse matrices... – Alex Reinking Nov 05 '13 at 17:14
  • @AlexReinking Yeah, and I mentioned it in my answer, but only difference is that I knew of its existance (saved a snippet with a reference and used in a project, so basically looked there when found your question). And note that this `possible dublicate` arrived just after I linked this Q to yours, i.e. when Q appeared in linked. ;) Consider renaming your question headline for better search usability would be appropriate. – alko Nov 05 '13 at 19:52
  • Sure thing! Do you have any suggestions for a better title? :) – Alex Reinking Nov 05 '13 at 19:54
  • 1
    Yes, I added the duplicate flag after you linked to it, @alko. Marking a question as having been answered elsewhere doesn't mean I think that Alex didn't do sufficient research! I actually do it because it benefits future visitors, who will find one common set of answers, regardless of whichever question comes up on a google or [SO] search. – askewchan Nov 06 '13 at 01:05
  • @Alex, I think the title is good, If you do want to change it, perhaps you could spell `sparse` with an `s` :P – askewchan Nov 06 '13 at 01:08
  • Done! And thanks for the vote of confidence askewchan :) – Alex Reinking Nov 06 '13 at 04:45

3 Answers3

10

A slightly different take on alko/seberg's approach. I find for loops disturbing, so I spent the better part of this morning figuring a way to get rid of it. The following is not always faster than the other approach. It performs better the more rows there are to be zeroed and the sparser the matrix:

def csr_zero_rows(csr, rows_to_zero):
    rows, cols = csr.shape
    mask = np.ones((rows,), dtype=np.bool)
    mask[rows_to_zero] = False
    nnz_per_row = np.diff(csr.indptr)

    mask = np.repeat(mask, nnz_per_row)
    nnz_per_row[rows_to_zero] = 0
    csr.data = csr.data[mask]
    csr.indices = csr.indices[mask]
    csr.indptr[1:] = np.cumsum(nnz_per_row)

And to test drive both approaches:

rows, cols = 334863, 334863
a = sps.rand(rows, cols, density=0.00001, format='csr')
b = a.copy()
rows_to_zero = np.random.choice(np.arange(rows), size=10000, replace=False)

In [117]: a
Out[117]: 
<334863x334863 sparse matrix of type '<type 'numpy.float64'>'
    with 1121332 stored elements in Compressed Sparse Row format>

In [118]: %timeit -n1 -r1 csr_rows_set_nz_to_val(a, rows_to_zero)
1 loops, best of 1: 75.8 ms per loop

In [119]: %timeit -n1 -r1 csr_zero_rows(b, rows_to_zero)
1 loops, best of 1: 32.5 ms per loop

And of course:

np.allclose(a.data, b.data)
Out[122]: True

np.allclose(a.indices, b.indices)
Out[123]: True

np.allclose(a.indptr, b.indptr)
Out[124]: True
Jaime
  • 65,696
  • 17
  • 124
  • 159
9

The solution I use for similar task attributes to @seberg and do not convert to lil format:

import scipy.sparse
import numpy
import time

def csr_row_set_nz_to_val(csr, row, value=0):
    """Set all nonzero elements (elements currently in the sparsity pattern)
    to the given value. Useful to set to 0 mostly.
    """
    if not isinstance(csr, scipy.sparse.csr_matrix):
        raise ValueError('Matrix given must be of CSR format.')
    csr.data[csr.indptr[row]:csr.indptr[row+1]] = value

def csr_rows_set_nz_to_val(csr, rows, value=0):
    for row in rows:
        csr_row_set_nz_to_val(csr, row)
    if value == 0:
        csr.eliminate_zeros()

wrap your evaluations with timing

def evaluate(size):
    degs = [1]*size
    inVs = list(xrange(1, size, size/25))
    outVs = list(xrange(5, size, size/25))
    where = numpy.hstack((inVs, outVs)).astype(int)
    start_time = time.time()
    A = scipy.sparse.csc_matrix((size, size))
    M = scipy.sparse.diags([degs], [0]) - A
    csr_rows_set_nz_to_val(M, where)
    return time.time()-start_time

and test its performance:

>>> print 'elapsed %.5f seconds' % evaluate(334863)
elapsed 0.53054 seconds
Community
  • 1
  • 1
alko
  • 46,136
  • 12
  • 94
  • 102
1

If you dislike digging around in the guts of the sparse matrices, you can also use sparse matrix multiplication with a diagonal matrix:

def zero_rows(M, rows_to_zero):

    ixs = numpy.ones(M.shape[0], int)
    ixs[rows_to_zero] = 0
    D = sparse.diags(ixs)
    res = D * M
    return res
Klapaucius
  • 11
  • 1