I have a script that reads in a corpus of many short documents and vectorizes them using sklearn
. The result is a large, sparse matrix (specifically, a scipy.sparse.csr.csr_matrix
) with dimensions 371k x 100k. My goal is to normalize this such that each row sums to 1, i.e. divide each entry by the sum of the entries in its row. I've tried several ways of doing this and each has given me a MemoryError
:
M /= M.sum(axis=1)
M_normalized = sklearn.preprocessing.normalize(M, axis=1, norm='l1')
- a for loop which sums and divides the rows one at a time and adds the result to an all-zero matrix
For the first option, I used pdb
to stop the script right before the normalization step so I could monitor memory consumption with htop. Interestingly, as soon as I stepped forward in the script to execute M /= M.sum(axis=1)
, the memory error was thrown immediately, in under a second.
My machine has 16GB of memory + 16GB swap, but only around 8GB + 16GB free at the point in the program where the normalization takes place.
Can anyone explain why all these methods are running into memory problems? Surely the third one at least should only use a small amount of memory, since it's only looking at one row at a time. Is there a more memory-efficient way to achieve this?