-1

I am trying to produce tf-idf on plain corpus of about 200k tokens. I produced vector counter at first that term frequency. Then I produced tf-idf matrix and got following results. My code is

from sklearn.feature_extraction.text import TfidfVectorizer
with open("D:\history.txt", encoding='utf8') as infile:
    contents = infile.readlines()
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=1.0, max_features=200000,
                                 min_df=0.0,
                                 use_idf=True, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to contents

print(tfidf_matrix)

Results

  (0, 8371)     0.0296607326158
  (0, 27755)    0.159032195629
  (0, 59369)    0.0871403881289
   :    :
  (551, 64746)  0.0324104689629
  (551, 10118)  0.0324104689629
  (551, 9308)   0.0324104689629

While I want to get results in following form

   (551, good ) 0.0324104689629
James Z
  • 12,209
  • 10
  • 24
  • 44
user103987
  • 65
  • 2
  • 9
  • Well, `TfidfVectorizer` just gives you a sparse matrix. It's up to you what to do with it afterwards. By `(551, good )` do you mean that you would like to do classification? Then see http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html – rth Apr 20 '17 at 12:20
  • This question is very unclear. What do you mean by "good"? The `(551, 9308)` represents the index of element `(row_num, col_num)`. What do you want to do? – Vivek Kumar Apr 20 '17 at 12:25
  • I want to shows tf-idf values with document words – user103987 Apr 20 '17 at 12:37
  • I am sorry. But I do not want classfication at this time. I just want to show the results with words. like good 0.9887766 love 0.56744 where as in the presented matrix term is shown with codes. – user103987 Apr 20 '17 at 12:38
  • Thank you I got your point and solve my problem. How I get graph of that data? May you help me – user103987 Apr 20 '17 at 12:52
  • Please post that as a new question – Vivek Kumar Apr 20 '17 at 13:17

1 Answers1

0

You can use the indexing from the sparse output tfidf_matrix and TfidfVectorizer.get_feature_names() to make the output you required:

features = tfidf_vectorizer.get_feature_names()
indices = zip(*tfidf_matrix.nonzero())
for row,column in indices:
    print('(%d, %s) %f' %(row, features[column], X[row, column])
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132