5

I wanted to repeat the rows of a scipy csr sparse matrix, but when I tried to call numpy's repeat method, it simply treats the sparse matrix like an object, and would only repeat it as an object in an ndarray. I looked through the documentation, but I couldn't find any utility to repeats the rows of a scipy csr sparse matrix.

I wrote the following code that operates on the internal data, which seems to work

def csr_repeat(csr, repeats):
    if isinstance(repeats, int):
        repeats = np.repeat(repeats, csr.shape[0])
    repeats = np.asarray(repeats)
    rnnz = np.diff(csr.indptr)
    ndata = rnnz.dot(repeats)
    if ndata == 0:
        return sparse.csr_matrix((np.sum(repeats), csr.shape[1]),
                                 dtype=csr.dtype)
    indmap = np.ones(ndata, dtype=np.int)
    indmap[0] = 0
    rnnz_ = np.repeat(rnnz, repeats)
    indptr_ = rnnz_.cumsum()
    mask = indptr_ < ndata
    indmap -= np.int_(np.bincount(indptr_[mask],
                                  weights=rnnz_[mask],
                                  minlength=ndata))
    jumps = (rnnz * repeats).cumsum()
    mask = jumps < ndata
    indmap += np.int_(np.bincount(jumps[mask],
                                  weights=rnnz[mask],
                                  minlength=ndata))
    indmap = indmap.cumsum()
    return sparse.csr_matrix((csr.data[indmap],
                              csr.indices[indmap],
                              np.r_[0, indptr_]),
                             shape=(np.sum(repeats), csr.shape[1]))

and be reasonably efficient, but I'd rather not monkey patch the class. Is there a better way to do this?

Edit

As I revisit this question, I wonder why I posted it in the first place. Almost everything I could think to do with the repeated matrix would be easier to do with the original matrix, and then apply the repetition afterwards. My assumption is that post repetition will always be the better way to approach this problem than any of the potential answers.

Erik
  • 6,470
  • 5
  • 36
  • 37

4 Answers4

5
from scipy.sparse import csr_matrix
repeated_row_matrix = csr_matrix(np.ones([repeat_number,1])) * sparse_row
user3357359
  • 139
  • 1
  • 4
  • 3
    Generally, answers are much more helpful if they include an explanation of what the code is intended to do, and why that solves the problem without introducing others. Thanks for improving the answer's reference value and making it more understandable! – Tim Diekmann Jun 08 '18 at 11:48
  • this answer only works when the sparse matrix to be repeated is actually a sparse vector. just using basic linear algebra... – user3357359 Jun 11 '18 at 11:57
  • @user3357359 if you're repeating a sparse vector, it's seems much quicker to just do something like `sparse_row[np.zeros(repeat_number),:]` – C.J. Jackson Nov 18 '19 at 05:54
2

It's not surprising that np.repeat does not work. It delegates the action to the hardcoded a.repeat method, and failing that, first turns a into an array (object if needed).

In the linear algebra world where sparse code was developed, most of the assembly work was done on the row, col, data arrays BEFORE creating the sparse matrix. The focus was on efficient math operations, and not so much on adding/deleting/indexing rows and elements.

I haven't worked through your code, but I'm not surprised that a csr format matrix requires that much work.

I worked out a similar function for the lil format (working from lil.copy):

def lil_repeat(S, repeat):
    # row repeat for lil sparse matrix
    # test for lil type and/or convert
    shape=list(S.shape)
    if isinstance(repeat, int):
        shape[0]=shape[0]*repeat
    else:
        shape[0]=sum(repeat)
    shape = tuple(shape)
    new = sparse.lil_matrix(shape, dtype=S.dtype)
    new.data = S.data.repeat(repeat) # flat repeat
    new.rows = S.rows.repeat(repeat)
    return new

But it is also possible to repeat using indices. Both lil and csr support indexing that is close to that of regular numpy arrays (at least in new enough versions). Thus:

S = sparse.lil_matrix([[0,1,2],[0,0,0],[1,0,0]])
print S.A.repeat([1,2,3], axis=0)
print S.A[(0,1,1,2,2,2),:]
print lil_repeat(S,[1,2,3]).A
print S[(0,1,1,2,2,2),:].A

give the same result

and best of all?

print S[np.arange(3).repeat([1,2,3]),:].A
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • 2
    `S[np.arange(3).repeat([1,2,3]),:]` is genius, and in some quick testing was also much faster than my method. – Erik Aug 19 '14 at 00:56
1

After someone posted a really clever response for how best to do this I revisited my original question, to see if there was an even better way. I I came up with one more way that has some pros and cons. Instead of repeating all of the data (as is done with the accepted answer), we can instead instruct scipy to reuse the data of the repeated rows, creating something akin to a view of the original sparse array (as you might do with broadcast_to). This can be done by simply tiling the indptr field.

repeated = sparse.csr_matrix((orig.data, orig.indices, np.tile(orig.indptr, repeat_num)))

This technique repeats the vector repeat_num times, while only modifying the the indptr. The downside is that due to the way the csr matrices encode data, instead of creating a matrix that's repeat_num x n in dimension, it creates one that's (2 * repeat_num - 1) x n where every odd row is 0. This shouldn't be too big of a deal as any operation will be quick given that each row is 0, and they should be pretty easy to slice out afterwards (with something like [::2]), but it's not ideal.

I think the marked answer is probably still the "best" way to do this.

Erik
  • 6,470
  • 5
  • 36
  • 37
0

One of the most efficient ways to repeat the sparse matrix would be the way OP suggested. I modified indptr so that it doesn't output rows of 0s.

## original sparse matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
x = scipy.sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
x.toarray()
array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]])

To repeat this, you need to repeat data and indices, and you need to fix-up the indptr. This is not the most elegant way, but it works.

## repeated sparse matrix
repeat = 5 
new_indptr = indptr
for r in range(1,repeat):
    new_indptr = np.concatenate((new_indptr, new_indptr[-1]+indptr[1:]))
x = scipy.sparse.csr_matrix((np.tile(data,repeat), np.tile(indices,repeat), new_indptr))
x.toarray()
array([[1, 0, 2],
       [0, 0, 3],
       [4, 5, 6],
       [1, 0, 2],
       [0, 0, 3],
       [4, 5, 6],
       [1, 0, 2],
       [0, 0, 3],
       [4, 5, 6],
       [1, 0, 2],
       [0, 0, 3],
       [4, 5, 6],
       [1, 0, 2],
       [0, 0, 3],
       [4, 5, 6]])