Normalize sparse row probability matrix

Question

I've got a sparse matrix with a few elements. Now I would like to row normalize it. However, when I do this, it gets converted to a numpy array, which is not acceptable from a performance standpoint.

To make things more concrete, consider the following example:

x = csr_matrix([[0, 1, 1], [2, 3, 0]])  # sparse
normalization = x.sum(axis=1)  # dense, this is OK

x / normalization  # this is dense, not OK, can be huge

Is there an elegant way to do this without having to resort to for loops?

EDIT

Yes, this can be done using sklearn.preprocessing.normalize using 'l1' normalization, however, I have no wish to depend on sklearn.

See https://stackoverflow.com/questions/12305021/efficient-way-to-normalize-a-scipy-sparse-matrix — Warren Weckesser, Mar 14 '18 at 20:18
... and https://stackoverflow.com/questions/30260642/normalizing-matrix-row-scipy-matrix — Warren Weckesser, Mar 14 '18 at 20:19
I have no wish to depend on either scikit learn or networkx. This is not very helpful. — Pavlin, Mar 14 '18 at 20:24
*"I have no wish to depend on either scikit learn or networkx."* Well, if you put that in the question, then we would *know* that those suggestions are not helpful. We can't read your mind! — Warren Weckesser, Mar 14 '18 at 20:27
The second answer in the suggested duplicate by 0x01 shows how to do it without scikit-learn. Also, the second answer in the second suggested duplicate by Warren Weckesser. — rayryeng, Mar 14 '18 at 20:29
@rayryeng Yes, however it is done using nested for loops, something I did make explicit that I don't want to do. My question is simply whether there is an elegant solution for this using numpy or scipy to get the nice speed benefits of those. — Pavlin, Mar 14 '18 at 20:32

score 6 · Accepted Answer · answered Mar 14 '18 at 20:33

6

You can always use csr internals:

>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> x = sparse.csr_matrix([[0, 1, 1], [2, 3, 0]]) 
>>> 
>>> x.data = x.data / np.repeat(np.add.reduceat(x.data, x.indptr[:-1]), np.diff(x.indptr))
>>> x
<2x3 sparse matrix of type '<class 'numpy.float64'>'
        with 4 stored elements in Compressed Sparse Row format>
>>> x.A
array([[0. , 0.5, 0.5],
       [0.4, 0.6, 0. ]])

answered Mar 14 '18 at 20:33

Paul Panzer

51,835
3
54
99

Thank you! This is exactly what I was looking for! – Pavlin Mar 14 '18 at 20:34
This is good to know. Thanks. +1. – rayryeng Mar 14 '18 at 20:36
On a 1000x800 random matrix, your answer is faster than even the `sklearn.preprocessing.normalize`. – hpaulj Mar 14 '18 at 21:39

Normalize sparse row probability matrix

1 Answers1