32

I'm trying to get words that are distinctive of certain documents using the TfIDFVectorizer class in scikit-learn. It creates a tfidf matrix with all the words and their scores in all the documents, but then it seems to count common words, as well. This is some of the code I'm running:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(contents)
feature_names = vectorizer.get_feature_names()
dense = tfidf_matrix.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names, index=characters)
s = pd.Series(df.loc['Adam'])
s[s > 0].sort_values(ascending=False)[:10]

I expected this to return a list of distinctive words for the document 'Adam', but what it does it return a list of common words:

and     0.497077
to      0.387147
the     0.316648
of      0.298724
in      0.186404
with    0.144583
his     0.140998

I might not understand it perfectly, but as I understand it, tf-idf is supposed to find words that are distinctive of one document in a corpus, finding words that appear frequently in one document, but not in other documents. Here, and appears frequently in other documents, so I don't know why it's returning a high value here.

The complete code I'm using to generate this is in this Jupyter notebook.

When I compute tf/idfs semi-manually, using the NLTK and computing scores for each word, I get the appropriate results. For the 'Adam' document:

fresh        0.000813
prime        0.000813
bone         0.000677
relate       0.000677
blame        0.000677
enough       0.000677

That looks about right, since these are words that appear in the 'Adam' document, but not as much in other documents in the corpus. The complete code used to generate this is in this Jupyter notebook.

Am I doing something wrong with the scikit code? Is there another way to initialize this class where it returns the right results? Of course, I can ignore stopwords by passing stop_words = 'english', but that doesn't really solve the problem, since common words of any sort shouldn't have high scores here.

Jonathan
  • 10,571
  • 13
  • 67
  • 103

5 Answers5

8

From scikit-learn documentation:

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model.

As you can see, TfidfVectorizer is a CountVectorizer followed by TfidfTransformer.

What you are probably looking for is TfidfTransformer and not TfidfVectorizer

Sagar Waghmode
  • 767
  • 5
  • 16
  • 4
    TfidfTransformer will transform the output of CountVectorizer, so I can run CountVectorizer and then run TfidfTransformer, but that's the same as running TfidfVectorizer. So I'm not convinced I need TfidfTransformer, if I'm going to have to run CountVectorizer first anyway. Won't it return the same results? – Jonathan Apr 22 '16 at 20:49
7

I believe your issue lies in using different stopword lists. Scikit-learn and NLTK use different stopword lists by default. For scikit-learn it is usually a good idea to have a custom stop_words list passed to TfidfVectorizer, e.g.:

my_stopword_list = ['and','to','the','of']
my_vectorizer = TfidfVectorizer(stop_words=my_stopword_list)

Doc page for TfidfVectorizer class: [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html][1]

Rabbit
  • 846
  • 6
  • 9
  • 4
    That's good to know, but I guess I'm confused about why one needs to remove stopwords to begin with. If 'and' or 'the' occurs frequently in all documents, let's say, then why would it have a high tf-idf value? It seems to me that the point of tf-idf is to adjust for the term's frequency across all documents, so that terms that occur frequently across the corpus won't appear at the top of the list. – Jonathan Apr 23 '16 at 20:04
  • 4
    @Jono, I guess your intuition is that TFIDF should benefit rare terms. This is half true. TFIDF takes into account two main things: TF, which is the term frequency in the document, and IDF, which is the inverse term frequency over the whole set of documents. TF benefits frequent terms, while IDF benefits rare terms. These two are almost opposing measures, which makes the TFIDF a balanced metric. – Rabbit Apr 23 '16 at 20:12
  • 1
    Also, stopword removal is a very common practice when using a vector-space representation. We can reason this way: for most applications, you want to have a metric that is high for important terms and low/zero for non-important ones. If your representation (TFIDF in this case) fails to do that, you counter this by removing a term that does will not help and potentially will hurt your model. – Rabbit Apr 23 '16 at 20:19
6

using below code I get much better results

vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')

Output

sustain    0.045090
bone       0.045090
thou       0.044417
thee       0.043673
timely     0.043269
thy        0.042731
prime      0.041628
absence    0.041234
rib        0.041234
feel       0.040259
Name: Adam, dtype: float64

and

thee          0.071188
thy           0.070549
forbids       0.069358
thou          0.068068
early         0.064642
earliest      0.062229
dreamed       0.062229
firmness      0.062229
glistering    0.062229
sweet         0.060770
Name: Eve, dtype: float64
realmq
  • 429
  • 4
  • 18
3

I'm not sure why it's not the default, but you probably want sublinear_tf=True in the initialization for TfidfVectorizer. I forked your repo and sent you a PR with an example that probably looks more like what you want.

Randy
  • 14,349
  • 2
  • 36
  • 42
  • Awesome. That's a big improvement. But if you run it with a smaller set of characters, instead of all the characters, you get lists of commonly-used words again: https://github.com/JonathanReeve/milton-analysis/blob/v0.2/tfidf-scikit.ipynb "And," "to," "the," and "of" are the words with the highest tf-idfs for Adam and Eve, but those are words that appear frequently across the corpus, so I don't know why they're getting hi tf-idf scores here. – Jonathan Apr 23 '16 at 21:17
  • Because you are now using much fewer documents. So the IDF, that grows in the number of times the term is found in a document (i.e., its a *per document count*), doesn't get very large with just four documents (<=4 for any term) and you don't have enough "statistical power". – fnl Apr 25 '16 at 08:07
  • @Jono, how come I get different result by running the same code. The only code difference is "vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')", then I seem to get much reasonable output for adam: sustain 0.045090 bone 0.045090 thou 0.044417 thee 0.043673 timely 0.043269 thy 0.042731 prime 0.041628 absence 0.041234 rib 0.041234 feel 0.040259 – realmq Jul 24 '18 at 01:00
0

The answer to your question may lie in the size of your corpus and source codes for different implementations. I haven't looked into the nltk code in detail, but 3-8 documents (in scikit code) are probably not big enough to construct a corpus. When constructing corpuses; news archives with with hundreds of thousands of articles or thousands of books are used. Maybe frequency of words like 'the' in 8 documents were not large overall to account for commonness of these words among those documents.

If you look at source codes, you might be able to find differences in implementation, whether they follow different normalization steps or frequency distributions (https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html has common tfidf variants)

Another thing that may help could be looking at the term frequencies (CountVectorizer in scikit) and making sure that words like 'the' are over represented in all documents.

user2827262
  • 157
  • 8