4

I have a bag-of-words representation of a corpus stored in an D by W sparse matrix word_freqs. Each row is a document and each column is a word. A given element word_freqs[d,w] represents the number of occurrences of word w in document d.

I'm trying to obtain another D by W matrix not_word_occs where, for each element of word_freqs:

  • If word_freqs[d,w] is zero, not_word_occs[d,w] should be one.
  • Otherwise, not_word_occs[d,w] should be zero.

Eventually, this matrix will need to be multiplied with other matrices which might be dense or sparse.


I've tried a number of methods, including:

not_word_occs = (word_freqs == 0).astype(int)

This words for toy examples, but results in a MemoryError for my actual data (which is approx. 18,000x16,000).

I've also tried np.logical_not():

word_occs = sklearn.preprocessing.binarize(word_freqs)
not_word_occs = np.logical_not(word_freqs).astype(int)

This seemed promising, but np.logical_not() does not work on sparse matrices, giving the following error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

Any ideas or guidance would be appreciated.

(By the way, word_freqs is generated by sklearn's preprocessing.CountVectorizer(). If there's a solution that involves converting this to another kind of matrix, I'm certainly open to that.)

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
ethan.roday
  • 2,485
  • 1
  • 23
  • 27

3 Answers3

1

The complement of the nonzero positions of a sparse matrix is dense. So if you want to achieve your stated goals with standard numpy arrays you will require quite a bit of RAM. Here's a quick and totally unscientific hack to give you an idea, how many arrays of that sort your computer can handle:

>>> import numpy as np
>>> a = []
>>> for j in range(100):
...     print(j)
...     a.append(np.ones((16000, 18000), dtype=int))

My laptop chokes at j=1. So unless you have a really good computer even if you can get the complement (you can do

>>> compl = np.ones(S.shape,int)
>>> compl[S.nonzero()] = 0

) memory will be an issue.

One way out may be to not explicitly compute the complement let's call it C = B1 - A, where B1 is the same-shape matrix completely filled with ones and A the adjacency matrix of your original sparse matrix. For example the matrix product XC can be written as XB1 - XA so you have one multiplication with the sparse A and one with B1 which is actually cheap because it boils down to computing row sums. The point here is that you can compute that without computing C first.

A particularly simple example would be multiplication with a one-hot vector. Such a multiplication just selects a column (if multiplying from the right) or row (if multiplying from the left) of the other matrix. Meaning you just need to find that column or row of the sparse matrix and take the complement (for a single slice no problem) and if you do this for a one-hot matrix, as above you needn't compute the complement explicitly.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • I think I follow you, but what is "the adjacency matrix of [my] original sparse matrix"? – ethan.roday Feb 05 '17 at 23:55
  • @err1100 Oh, jargon, sorry. It's the same matrix with all nonzero entries set to one. Btw. I'm pretty sure the scheme I describe in the last paragraph will be applicable to one-hot matrices. Let me think a bit and I'll add a few lines. – Paul Panzer Feb 06 '17 at 00:02
  • Thanks, Paul. I was able to get things working thanks to your decomposition. Computing the actual matrices still used too much RAM (specifically, making a 16K by 18K matrix of ones), but your one-hot paragraph allowed me to figure out how to do that without making the matrix. – ethan.roday Feb 06 '17 at 01:49
0

Make a small sparse matrix:

In [743]: freq = sparse.random(10,10,.1)
In [744]: freq
Out[744]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in COOrdinate format>

the repr(freq) shows the shape, elements and format.

In [745]: freq==0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
  ", try using != instead.", SparseEfficiencyWarning)
Out[745]: 
<10x10 sparse matrix of type '<class 'numpy.bool_'>'
    with 90 stored elements in Compressed Sparse Row format>

If do your first action, I get a warning and new array with 90 (out of 100) nonzero terms. That not is no longer sparse.

In general numpy functions do not work when applied to sparse matrices. To work they have to delegate the task to sparse methods. But even if logical_not worked it wouldn't solve the memory issue.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • In that case, do you have any suggestions? For context, I have a one-hot _D_ by _K_ matrix `doc_topics` representing the topic of each document. Later on, I need to end up with a _W_ by _K_ matrix that tells me how many documents of topic _k_ do _not_ contain word _w_. With my toy examples, I was accomplishing this with `not_word_occs.T @ doc_topics`, but I now need another way to do that. – ethan.roday Feb 05 '17 at 23:44
  • @hpaulj Once I try to use the logical not this way, if I try to do a Hadamard product with another sparse matrix, I get a dimension mismatch error, which is strange. – hegdep May 14 '20 at 15:41
0

Here is an example of using Pandas.SparseDataFrame:

In [42]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)

In [43]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)

In [44]: d1 = pd.SparseDataFrame(X.toarray(), default_fill_value=0, dtype=np.int64)

In [45]: d2 = pd.SparseDataFrame(np.ones((10,10)), default_fill_value=1, dtype=np.int64)

In [46]: d1.memory_usage()
Out[46]:
Index    80
0        16
1         0
2         8
3        16
4         0
5         0
6        16
7        16
8         8
9         0
dtype: int64

In [47]: d2.memory_usage()
Out[47]:
Index    80
0         0
1         0
2         0
3         0
4         0
5         0
6         0
7         0
8         0
9         0
dtype: int64

math:

In [48]: d2 - d1
Out[48]:
   0  1  2  3  4  5  6  7  8  9
0  1  1  0  0  1  1  0  1  1  1
1  1  1  1  1  1  1  1  1  0  1
2  1  1  1  1  1  1  1  1  1  1
3  1  1  1  1  1  1  1  0  1  1
4  1  1  1  1  1  1  1  1  1  1
5  0  1  1  1  1  1  1  1  1  1
6  1  1  1  1  1  1  1  1  1  1
7  0  1  1  0  1  1  1  0  1  1
8  1  1  1  1  1  1  0  1  1  1
9  1  1  1  1  1  1  1  1  1  1

source sparse matrix:

In [49]: d1
Out[49]:
   0  1  2  3  4  5  6  7  8  9
0  0  0  1  1  0  0  1  0  0  0
1  0  0  0  0  0  0  0  0  1  0
2  0  0  0  0  0  0  0  0  0  0
3  0  0  0  0  0  0  0  1  0  0
4  0  0  0  0  0  0  0  0  0  0
5  1  0  0  0  0  0  0  0  0  0
6  0  0  0  0  0  0  0  0  0  0
7  1  0  0  1  0  0  0  1  0  0
8  0  0  0  0  0  0  1  0  0  0
9  0  0  0  0  0  0  0  0  0  0

memory usage:

In [50]: (d2 - d1).memory_usage()
Out[50]:
Index    80
0        16
1         0
2         8
3        16
4         0
5         0
6        16
7        16
8         8
9         0
dtype: int64

PS if you can't build the whole SparseDataFrame at once (because of memory constraints), you can use an approach similar to one used in this answer

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419