0

I've got a sparse matrix with a few elements. Now I would like to row normalize it. However, when I do this, it gets converted to a numpy array, which is not acceptable from a performance standpoint.

To make things more concrete, consider the following example:

x = csr_matrix([[0, 1, 1], [2, 3, 0]])  # sparse
normalization = x.sum(axis=1)  # dense, this is OK

x / normalization  # this is dense, not OK, can be huge

Is there an elegant way to do this without having to resort to for loops?

EDIT

Yes, this can be done using sklearn.preprocessing.normalize using 'l1' normalization, however, I have no wish to depend on sklearn.

Pavlin
  • 5,390
  • 6
  • 38
  • 51
  • See https://stackoverflow.com/questions/12305021/efficient-way-to-normalize-a-scipy-sparse-matrix – Warren Weckesser Mar 14 '18 at 20:18
  • ... and https://stackoverflow.com/questions/30260642/normalizing-matrix-row-scipy-matrix – Warren Weckesser Mar 14 '18 at 20:19
  • I have no wish to depend on either scikit learn or networkx. This is not very helpful. – Pavlin Mar 14 '18 at 20:24
  • 1
    *"I have no wish to depend on either scikit learn or networkx."* Well, if you put that in the question, then we would *know* that those suggestions are not helpful. We can't read your mind! – Warren Weckesser Mar 14 '18 at 20:27
  • You're right, I should've made that explicit. – Pavlin Mar 14 '18 at 20:28
  • The second answer in the suggested duplicate by 0x01 shows how to do it without scikit-learn. Also, the second answer in the second suggested duplicate by Warren Weckesser. – rayryeng Mar 14 '18 at 20:29
  • @rayryeng Yes, however it is done using nested for loops, something I did make explicit that I don't want to do. My question is simply whether there is an elegant solution for this using numpy or scipy to get the nice speed benefits of those. – Pavlin Mar 14 '18 at 20:32

1 Answers1

6

You can always use csr internals:

>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> x = sparse.csr_matrix([[0, 1, 1], [2, 3, 0]]) 
>>> 
>>> x.data = x.data / np.repeat(np.add.reduceat(x.data, x.indptr[:-1]), np.diff(x.indptr))
>>> x
<2x3 sparse matrix of type '<class 'numpy.float64'>'
        with 4 stored elements in Compressed Sparse Row format>
>>> x.A
array([[0. , 0.5, 0.5],
       [0.4, 0.6, 0. ]])
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99