tf-idf results analysis with python

Question

I am trying to produce tf-idf on plain corpus of about 200k tokens. I produced vector counter at first that term frequency. Then I produced tf-idf matrix and got following results. My code is

from sklearn.feature_extraction.text import TfidfVectorizer
with open("D:\history.txt", encoding='utf8') as infile:
    contents = infile.readlines()
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=1.0, max_features=200000,
                                 min_df=0.0,
                                 use_idf=True, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to contents

print(tfidf_matrix)

Results

  (0, 8371)     0.0296607326158
  (0, 27755)    0.159032195629
  (0, 59369)    0.0871403881289
   :    :
  (551, 64746)  0.0324104689629
  (551, 10118)  0.0324104689629
  (551, 9308)   0.0324104689629

While I want to get results in following form

   (551, good ) 0.0324104689629

Well, `TfidfVectorizer` just gives you a sparse matrix. It's up to you what to do with it afterwards. By `(551, good )` do you mean that you would like to do classification? Then see http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html — rth, Apr 20 '17 at 12:20
This question is very unclear. What do you mean by "good"? The `(551, 9308)` represents the index of element `(row_num, col_num)`. What do you want to do? — Vivek Kumar, Apr 20 '17 at 12:25
I am sorry. But I do not want classfication at this time. I just want to show the results with words. like good 0.9887766 love 0.56744 where as in the presented matrix term is shown with codes. — user103987, Apr 20 '17 at 12:38
Thank you I got your point and solve my problem. How I get graph of that data? May you help me — user103987, Apr 20 '17 at 12:52

score 0 · Answer 1 · answered Apr 20 '17 at 13:17

0

You can use the indexing from the sparse output tfidf_matrix and TfidfVectorizer.get_feature_names() to make the output you required:

features = tfidf_vectorizer.get_feature_names()
indices = zip(*tfidf_matrix.nonzero())
for row,column in indices:
    print('(%d, %s) %f' %(row, features[column], X[row, column])

answered Apr 20 '17 at 13:17

Vivek Kumar

35,217
8
109
132

May you help me generating graph / plot of this data – user103987 Apr 23 '17 at 16:37
I did not understand. Which type of graph you want to generate? – Vivek Kumar Apr 24 '17 at 01:22
I want to present tf-idf results in shape of 2D plot. The results are available above in my asked question. – user103987 Apr 24 '17 at 09:53
@user103987 And what does the two dimensions represent? – Vivek Kumar Apr 24 '17 at 09:56

tf-idf results analysis with python

1 Answers1