I have a bag-of-words representation of a corpus stored in an D by W sparse matrix word_freqs
. Each row is a document and each column is a word. A given element word_freqs[d,w]
represents the number of occurrences of word w in document d.
I'm trying to obtain another D by W matrix not_word_occs
where, for each element of word_freqs
:
- If
word_freqs[d,w]
is zero,not_word_occs[d,w]
should be one. - Otherwise,
not_word_occs[d,w]
should be zero.
Eventually, this matrix will need to be multiplied with other matrices which might be dense or sparse.
I've tried a number of methods, including:
not_word_occs = (word_freqs == 0).astype(int)
This words for toy examples, but results in a MemoryError
for my actual data (which is approx. 18,000x16,000).
I've also tried np.logical_not()
:
word_occs = sklearn.preprocessing.binarize(word_freqs)
not_word_occs = np.logical_not(word_freqs).astype(int)
This seemed promising, but np.logical_not()
does not work on sparse matrices, giving the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
Any ideas or guidance would be appreciated.
(By the way, word_freqs
is generated by sklearn's preprocessing.CountVectorizer()
. If there's a solution that involves converting this to another kind of matrix, I'm certainly open to that.)