5

I have a Documents with 5000 reviews. I applied tf-idf on that document. Here sample_data contains 5000 reviews. I am applying tf-idf vectorizer on the sample_data with one gram range. Now I want to get the top 1000 words from the sample_data which have highest tf-idf values. Could anyone tell me how to get the top words?

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1))
tf_idf_vect.fit(sample_data)
final_tf_idf = tf_idf_vect.transform(sample_data)
merkle
  • 1,585
  • 4
  • 18
  • 33

1 Answers1

5

TF-IDF values depend on individual documents. You can get top 1000 terms based on their count (Tf) by using the max_features parameter of TfidfVectorizer:

max_features : int or None, default=None

If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.

Just do:

tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), max_features=1000)

You can even get the 'idf' (global term weights) from the tf_idf_vect after fitting (learning) of documents by using idf_ attribute:

idf_ : array, shape = [n_features], or None

  The learned idf vector (global term weights) when use_idf is set to True,  

Do this after calling tf_idf_vect.fit(sample_data):

idf = tf_idf_vect.idf_

And then select the top 1000 from them and re-fit the data based on those selected features.

But you cannot get top 1000 by "tf-idf", because the tf-idf is the product of tf of a term in a single document with idf (global) of the vocabulary. So for same word which appeared 2 times in a single document will have twice the tf-idf than the same word which appeared in another document only once. How can you compare the different values of the same term. Hope this makes it clear.

Community
  • 1
  • 1
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • 1
    Presumably OP wants to treat the 5000 reviews as separate documents, thus the 'document' they mention is really a corpus? And in that case TFIDF is well-defined. – smci Aug 02 '18 at 17:42
  • @smci Sorry I do not understand. If the OP wants to treat each review as separate documents, so does he want top 1000 terms from each review separately? – Vivek Kumar Aug 03 '18 at 02:03