4

I am building ngrams from multiple text documents using scikit-learn. I need to build document-frequency using countVectorizer.

Example :

document1 = "john is a nice guy"

document2 = "person can be a guy"

So, document-frequency will be

{'be': 1,
 'can': 1,
 'guy': 2,
 'is': 1,
 'john': 1,
 'nice': 1,
 'person': 1}

Here documents are just strings but when I tried with huge amount of data. It throws MEMORY ERROR.

Code :

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))
X = vectorizer.fit_transform(document).todense()
tranformer = vectorizer.transform(document).todense()
matrix_terms = np.array(vectorizer.get_feature_names())
lst_freq =  map(sum,zip(*tranformer.A))          
matrix_freq = np.array(lst_freq)
final_matrix = np.array([matrix_terms,matrix_freq])

ERROR :

Traceback (most recent call last):
  File "demo1.py", line 13, in build_ngrams_matrix
    X = vectorizer.fit_transform(document).todense()
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 605, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 901, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 269, in toarray
    B = self._process_toarray_args(order, out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 789, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
iNikkz
  • 3,729
  • 5
  • 29
  • 59
  • 2
    Have you check http://stackoverflow.com/questions/16332083/python-memoryerror-when-doing-fitting-with-scikit-learn or http://stackoverflow.com/questions/23879139/memory-error-at-python-while-converting-to-array? – fredtantini Nov 12 '14 at 13:24
  • I think `using todense()` while generating `MEMORY ERROR`. But when it doesn't use `todense()`, its gives the output in `sparse matrix`. I don't know to read that sparse matrix. Any help plz? – iNikkz Nov 12 '14 at 14:11
  • If you really want to look at the sparse matrix, you can look at a small chunk of it (e.g. the first 10 lines) like so `X[:10,:].todense()`. Most other operations, such as summation, work the same way for sparse and dense matrices, so you don't really need to call `todense/A/toarray` – mbatchkarov Nov 12 '14 at 15:51

1 Answers1

11

As the comments have mentioned, you're running into memory issues when you convert the large sparse matrices to dense format. Try something like this:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))

# Don't need both X and transformer; they should be identical
X = vectorizer.fit_transform(document)
matrix_terms = np.array(vectorizer.get_feature_names())

# Use the axis keyword to sum over rows
matrix_freq = np.asarray(X.sum(axis=0)).ravel()
final_matrix = np.array([matrix_terms,matrix_freq])

EDIT: If you want a dictionary from term to frequency, try this after calling fit_transform:

terms = vectorizer.get_feature_names()
freqs = X.sum(axis=0).A1
result = dict(zip(terms, freqs))
perimosocordiae
  • 17,287
  • 14
  • 60
  • 76
  • thanks @perimosocordiae. Suggest me to use fit_transform() as both were identical. – iNikkz Nov 13 '14 at 07:08
  • `@perimosocordiae` : Need more help. You `final_matrix` is a matrix. Now I want to convert it into `dictionary` in `rapid` way. I used `dict(zip(final_matrix[0],final_matrix[1]))` this. But it takes time in seconds. Any other way to convert matrix into dictionary? – iNikkz Nov 13 '14 at 07:21
  • 1
    I'm not sure if it will be much faster, but I've updated my answer to show how to make the dictionary you want. – perimosocordiae Nov 13 '14 at 16:15