0

I have 2 documents doc1.txt and doc2.txt. The contents of these 2 documents are:

 #doc1.txt
 very good, very bad, you are great

 #doc2.txt
 very bad, good restaurent, nice place to visit

I want to make my corpus separated with , so that my final DocumentTermMatrix becomes:

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
 doc1       tf-idf          tf-idf         tf-idf          0                    0
 doc2       0               tf-idf         0               tf-idf             tf-idf

I know, how to calculate DocumentTermMatrix of individual words (using http://scikit-learn.org/stable/modules/feature_extraction.html) but don't know how to calculate DocumentTermMatrix of strings in Python.

YS-L
  • 14,358
  • 3
  • 47
  • 58
user2481422
  • 868
  • 3
  • 17
  • 31

1 Answers1

5

You can specify the analyzer argument of TfidfVectorizer as a function which extracts the features in a customized way:

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ['very good, very bad, you are great',
        'very bad, good restaurent, nice place to visit']

tfidf = TfidfVectorizer(analyzer=lambda d: d.split(', ')).fit(docs)
print tfidf.get_feature_names()

The resulting features are:

['good restaurent', 'nice place to visit', 'very bad', 'very good', 'you are great']

If you really cannot afford to load all the data into memory, this is a workaround:

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ['doc1.txt', 'doc2.txt']

def extract(filename):
    with open(filename) as f:
        features = []
        for line in f:
            features += line.strip().split(', ')
        return features

tfidf = TfidfVectorizer(analyzer=extract).fit(docs)
print tfidf.get_feature_names()

which loads each document one at a time without holding all of them in the memory at once.

YS-L
  • 14,358
  • 3
  • 47
  • 58
  • I have many documents, so I need to load my `txt` files and cannot create a list manually. – user2481422 Jun 10 '14 at 08:22
  • 1
    Can you load the contents of the files beforehand into a list? If not, please see the edit. – YS-L Jun 10 '14 at 08:35
  • Here, it is considering each `string` in `tfidf.get_feature_names()` as single document. I want only 2 documents and 5 texts as shown in my question. – user2481422 Jun 10 '14 at 11:35
  • @user2481422 By "5 texts", aren't you referring to the five columns of the document-term matrix? The `get_feature_names()` function returns the "texts" corresponding to each column, i.e. the features. To get the 2 x 5 sparse matrix, simply use ``tfidf.transform(docs)``. – YS-L Jun 10 '14 at 15:14
  • @user2481422 Actually, here, the two documents are used to fit transformer. Note that the `X` passed to the `fit` function is exactly `docs`, containing two documents. Just that it uses a customized way to extract features (normally it is word-by-word, but not here), resulting in that five features. I believe this is actually what you want; if not, please let me know. – YS-L Jun 12 '14 at 06:41
  • I have shown my desired result in the question. I want `2 documents - doc1 and doc2` and `terms as - very good very bad you are great good restaurent nice place to visit`- as shown in my desired output. – user2481422 Jun 17 '14 at 15:32