1

I have a script that reads in a corpus of many short documents and vectorizes them using sklearn. The result is a large, sparse matrix (specifically, a scipy.sparse.csr.csr_matrix) with dimensions 371k x 100k. My goal is to normalize this such that each row sums to 1, i.e. divide each entry by the sum of the entries in its row. I've tried several ways of doing this and each has given me a MemoryError:

  • M /= M.sum(axis=1)
  • M_normalized = sklearn.preprocessing.normalize(M, axis=1, norm='l1')
  • a for loop which sums and divides the rows one at a time and adds the result to an all-zero matrix

For the first option, I used pdb to stop the script right before the normalization step so I could monitor memory consumption with htop. Interestingly, as soon as I stepped forward in the script to execute M /= M.sum(axis=1), the memory error was thrown immediately, in under a second.

My machine has 16GB of memory + 16GB swap, but only around 8GB + 16GB free at the point in the program where the normalization takes place.

Can anyone explain why all these methods are running into memory problems? Surely the third one at least should only use a small amount of memory, since it's only looking at one row at a time. Is there a more memory-efficient way to achieve this?

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • the 3rd way (in a loop) should work if you don't save the result to a new matrix but you divide the line of the same matrix (`M[i] /= M[i].sum()`). This will save you from creating a second matrix of the same size and therefore save memory. – dnalow Sep 15 '16 at 16:29
  • How many nonzero values are in the matrix? – Warren Weckesser Sep 15 '16 at 16:30
  • This is a large sparse matrix; so the usual `numpy` array ideas don't apply. For one thing, `M.sum(axis=1)` is performed with a matrix multiplication. Row indexing `M[i,:]` is also much slower. And for multiple rows is actually performed with a matrix multiplication (as discussed in another recent `sparse` question). – hpaulj Sep 15 '16 at 16:32
  • Shot in the dark (I haven't tried it): Use `sklearn.preprocessing.normalize(M, axis=1, norm='l1', copy=False)` to normalize `M` in-place. – Warren Weckesser Sep 15 '16 at 16:39
  • `sklearn` uses compiled `cython` code in `https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs_fast.pyx` to perform the `inplace_csr_row_normalize_l1`. It looks a lot like what I'd do if doing the normalization inplace using the matrix `indptr`, `indicies` and `data` attributes. – hpaulj Sep 15 '16 at 16:48
  • @hpaulj Great. Does it work with nomarch's matrix? I haven't tried it yet. – Warren Weckesser Sep 15 '16 at 16:49
  • @nomarch: See the accepted answer here: http://stackoverflow.com/questions/8358962/efficiently-row-standardize-a-matrix – Warren Weckesser Sep 15 '16 at 16:52
  • I tried `sklearn.preprocessing.normalize(M, axis=1, norm='l1', copy=False)` with a sparse CSR matrix of size (371000, 100000) and 18550000 nonzero elements, and it worked fine. – Warren Weckesser Sep 15 '16 at 16:55
  • If making a copy during this normalization step causes a memory error, what will the rest of the `sklearn` processing do? I just answered a question about splitting a sparse matrix into training and testing matrices. – hpaulj Sep 15 '16 at 17:48
  • @Warren Weckesser My matrix has just under 21M nonzero entries, or about 0.056% of the total. Normalizing in-place with `sklearn.preprocessing.normalize(M, axis=1, norm='l1', copy=False)` appears to have done the trick. –  Sep 15 '16 at 17:56

0 Answers0