0

I have a scipy.sparse.csc_matrix with dtype = np.int32. I want to efficiently divide each column (or row, whichever faster for csc_matrix) of the matrix by the diagonal element in that column. So mnew[:,i] = m[:,i]/m[i,i] . Note that I need to convert my matrix to np.double (since mnew elements will be in [0,1]) and since the matrix is massive and very sparse I wonder if I can do it in some efficient/no for loop/never going dense way.

Best,

Ilya

obachtos
  • 977
  • 1
  • 12
  • 30
Ilya
  • 561
  • 2
  • 17
  • What if that `m[i,i]` value is 0? Getting the diagonal should be easy, and multiplying will also be efficient. How about giving us a small example, e.g a 10x10 matrix, and demonstrate with its dense equivalent. – hpaulj Jul 27 '17 at 00:31
  • m[i,i] is guaranteed to be non zero and larger than any value in it's row/column. This is a co-occurance matrix of items coming in short (5-20 item) lists (calculated over several 100k such lists). The matrix size is (numUniqueItems,numUniqueItems). Thus the diagonal elements signify in how many lists a particular item appeared, off diagonal elements signify in how many lists both ith and jth items appeared. Dividing along a column (or row) by the diagonal element will be p(j'th item appeared | i'th item appeared) – Ilya Jul 27 '17 at 01:30

1 Answers1

2

Make a sparse matrix:

In [379]: M = sparse.random(5,5,.2, format='csr')
In [380]: M
Out[380]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>
In [381]: M.diagonal()
Out[381]: array([ 0.,  0.,  0.,  0.,  0.])

too many 0s in the diagonal - lets add a nonzero diagonal:

In [382]: D=sparse.dia_matrix((np.random.rand(5),0),shape=(5,5))
In [383]: D
Out[383]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements (1 diagonals) in DIAgonal format>
In [384]: M1 = M+D


In [385]: M1
Out[385]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

In [387]: M1.A
Out[387]: 
array([[ 0.35786668,  0.81754484,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.41928992,  0.        ,  0.01371273,  0.        ],
       [ 0.        ,  0.        ,  0.4685924 ,  0.        ,  0.35724102],
       [ 0.        ,  0.        ,  0.77591294,  0.95008721,  0.16917791],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.16659141]])

Now it's trivial to divide each column by its diagonal (this is a matrix 'product')

In [388]: M1/M1.diagonal()
Out[388]: 
matrix([[ 1.        ,  1.94983185,  0.        ,  0.        ,  0.        ],
        [ 0.        ,  1.        ,  0.        ,  0.01443313,  0.        ],
        [ 0.        ,  0.        ,  1.        ,  0.        ,  2.1444144 ],
        [ 0.        ,  0.        ,  1.65583764,  1.        ,  1.01552603],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  1.        ]])

Or divide the rows - (multiply by a column vector)

In [391]: M1/M1.diagonal()[:,None]

oops, these are dense; let's make the diagonal sparse

In [408]: md = sparse.csr_matrix(1/M1.diagonal())  # do the inverse here
In [409]: md
Out[409]: 
<1x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>
In [410]: M.multiply(md)
Out[410]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>
In [411]: M.multiply(md).A
Out[411]: 
array([[ 0.        ,  1.94983185,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.01443313,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  2.1444144 ],
       [ 0.        ,  0.        ,  1.65583764,  0.        ,  1.01552603],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

md.multiply(M) for the column version.

Division of sparse matrix - similar except it is using the sum of the rows instead of the diagonal. Deals a bit more with the potential 'divide-by-zero' issue.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Unfortunately this throws a memory error. In particular true divide calls np.true_divide(self.todense(), other) making the matrix dense (and thus impossible to fit in my memory) – Ilya Jul 27 '17 at 01:39
  • Sorry, I didn't notice that my divides returned `array` and `matrix`. Multiplication/division with a dense does produce a dense. Got to make the diagonal's sparse. – hpaulj Jul 27 '17 at 01:46
  • Your edit works perfectly and does what I need, thanks! – Ilya Jul 27 '17 at 02:03
  • I found an earlier question that divides by the row sums. – hpaulj Jul 27 '17 at 03:54