0

I have a sparse matrix of size (n x m):

sparse_dtm = dok_matrix((num_documents, vocabulary_size), dtype=np.float32)
        for doc_index, document in enumerate(data):
            document_counter = Counter(document)
            for word in set(document):
                sparse_dtm[doc_index, word_to_index[word]] = document_counter[word]

Where:

  • num_documents = n
  • vocabulary_size = m
  • data = list of tokenized lists

Also, I have a list with length n:

sums = sparse_dtm.sum(1).tolist()

Now, I want to do an element-wise division in which each cell of row_i in sparse_dtm is divided by sums[i].

A naive approach, using the traditition Python element-wise division:

sparse_dtm / sums

Leads into the following error:

TypeError: unsupported operand type(s) for /: 'csr_matrix' and 'list'

How can I perform this element-wise division?

Emil
  • 1,531
  • 3
  • 22
  • 47

2 Answers2

1

If I correctly understand, you need to divide each row by the sum of row, is that correct?

In this case, you'd need to reshape the sum

sparse_dtm / sparse_dtm.sum(1).reshape(-1, 1)

you can also do it with a pandas DataFrame, for example

row_num = 10
col_num = 5
sparse_dtm = np.ndarray((row_num, col_num), dtype=np.float32)
for row in range(row_num):
    for col in range(col_num):
        value = (row+1) * (col+2)
        sparse_dtm[row, col] = value
df = pd.DataFrame(sparse_dtm)
print(df)

gives

      0     1     2     3     4
0   2.0   3.0   4.0   5.0   6.0
1   4.0   6.0   8.0  10.0  12.0
2   6.0   9.0  12.0  15.0  18.0
3   8.0  12.0  16.0  20.0  24.0
4  10.0  15.0  20.0  25.0  30.0
5  12.0  18.0  24.0  30.0  36.0
6  14.0  21.0  28.0  35.0  42.0
7  16.0  24.0  32.0  40.0  48.0
8  18.0  27.0  36.0  45.0  54.0
9  20.0  30.0  40.0  50.0  60.0

and then divide each row for the sum of row

df / df.sum(axis=1).values.reshape(-1, 1)

that gives

     0     1    2     3    4
0  0.1  0.15  0.2  0.25  0.3
1  0.1  0.15  0.2  0.25  0.3
2  0.1  0.15  0.2  0.25  0.3
3  0.1  0.15  0.2  0.25  0.3
4  0.1  0.15  0.2  0.25  0.3
5  0.1  0.15  0.2  0.25  0.3
6  0.1  0.15  0.2  0.25  0.3
7  0.1  0.15  0.2  0.25  0.3
8  0.1  0.15  0.2  0.25  0.3
9  0.1  0.15  0.2  0.25  0.3
Max Pierini
  • 2,027
  • 11
  • 17
  • Thanks Max, your first suggestion works. However, I am working with a sparse matrix as I would run into memory issues with a regular matrix. Therefore, the Pandas' Dataframe suggestion does not work in my case. – Emil Mar 30 '21 at 12:37
  • `sparse_dtm.sum(1)` should already be a (n,1) shaped `np.matrix`. No need to reshape. – hpaulj Mar 30 '21 at 16:36
  • @hpaulj with `np.ndarray` using the for loop I defined to populate di nd-array, you have shape `sparse_dtm.shape` equal to `(10, 5)`. If you do `sparse_dtm.sum(1)` you have a shape `sparse_dtm.sum(1).shape` equal to `(10,)` but you need a shape of `(10, 1)` as you said, so you have to reshape `.reshape(-1, 1)` – Max Pierini Mar 30 '21 at 17:14
  • For a dense array, `ndarray` you do need the reshape, or use `keepdims`. But you continued to call it `sparse_dtm`, so I assumed it was `scipy.sparse` as the OP created. – hpaulj Mar 30 '21 at 18:20
0
In [189]: M = sparse.dok_matrix([[0,1,3,0],[0,0,2,0],[1,0,0,0]])
In [190]: M
Out[190]: 
<3x4 sparse matrix of type '<class 'numpy.int64'>'
    with 4 stored elements in Dictionary Of Keys format>
In [191]: M.A
Out[191]: 
array([[0, 1, 3, 0],
       [0, 0, 2, 0],
       [1, 0, 0, 0]])

sum(1) produces a (3,1) np.matrix, which can be used directly in the division:

In [192]: M.sum(1)
Out[192]: 
matrix([[4],
        [2],
        [1]])
In [193]: M/M.sum(1)
Out[193]: 
matrix([[0.  , 0.25, 0.75, 0.  ],
        [0.  , 0.  , 1.  , 0.  ],
        [1.  , 0.  , 0.  , 0.  ]])

Note that the result is a dense np.matrix, not sparse.

This could give problems if the a row sum was 0, but with your construction that might not be the possible.

We can retain the sparse result by first converting the sums to sparse. I'm using the inverse because there isn't a sparse element-wise division (re. all those 0s):

In [205]: D=sparse.csr_matrix(1/M.sum(1))
In [206]: D
Out[206]: 
<3x1 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
In [207]: D.A
Out[207]: 
array([[0.25],
       [0.5 ],
       [1.  ]])
In [208]: D.multiply(M)
Out[208]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [209]: _.A
Out[209]: 
array([[0.  , 0.25, 0.75, 0.  ],
       [0.  , 0.  , 1.  , 0.  ],
       [1.  , 0.  , 0.  , 0.  ]])

sklearn also has some added sparse matrix utilities

In [210]: from sklearn import preprocessing
In [211]: preprocessing.normalize(M, norm='l1', axis=1)
Out[211]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [212]: _.A
Out[212]: 
array([[0.  , 0.25, 0.75, 0.  ],
       [0.  , 0.  , 1.  , 0.  ],
       [1.  , 0.  , 0.  , 0.  ]])
hpaulj
  • 221,503
  • 14
  • 230
  • 353