3

Given the following code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import urllib.request  # the lib that handles the url stuff
from bs4 import BeautifulSoup
import unicodedata

def remove_control_characters(s):
    base = ""
    for ch in s:
        if unicodedata.category(ch)[0]!="C":
            base = base + ch.lower()
        else:
            base = base + " "
    return base 

moby_dick_url='http://www.gutenberg.org/files/2701/2701-0.txt'

soul_of_japan = 'http://www.gutenberg.org/files/12096/12096-0.txt'

def extract_body(url):
    with urllib.request.urlopen(url) as s:
        data = BeautifulSoup(s).body()[0].string
        stripped = remove_control_characters(data)
        return stripped

moby = extract_body(moby_dick_url)    
bushido = extract_body(soul_of_japan)

corpus = [moby,bushido]

vectorizer = TfidfVectorizer(use_idf=False, smooth_idf=True)
tf_idf = vectorizer.fit_transform(corpus)
df_tfidf = pd.DataFrame(tf_idf.toarray(), columns=vectorizer.get_feature_names(), index=["Moby", "Bushido"])
df_tfidf[["the", "whale"]]

I would expect "whale" to be given a relatively high tf-idf in "Moby Dick", but a low score in "Bushido: The Soul of Japan", and "the" to be given a low score in both. However, I get the opposite. The results that are calculated are:

|       |     the   | whale    |
|-------|-----------|----------|
|Moby   | 0.707171  | 0.083146 |
|Bushido| 0.650069  | 0.000000 |

Which makes no sense to me. Can anyone point to the mistake in either thinking or coding that I have made?

James Hamilton
  • 457
  • 3
  • 16

1 Answers1

2

There are two reasons why you are observing this.

  • The first is because of the parameters you passed to your Tfidf Vectorizer. You should be doing TfidfVectorizer(use_idf=True, ...), because it is the idf part of the tfidf (remember that tf-idf is the product of term frequency and inverse document frequency) that will penalize words that appear in all documents. By setting TfidfVectorizer(use_idf=False, ..), you are just considering the term frequency part, which obviously leads to stopwords having a larger score

  • The second is because of your data. Suppose you fixed the code problem above, your corpus is still very very small, just two documents. This means that any word that appears in both books will get penalized the same way. "courage" might appear in both books, just as "the", and so given they both appear in each document of your corpus, their idf value will be the same, causing stopwords to again have a larger score because of their larger term-frequency

MaximeKan
  • 4,011
  • 11
  • 26
  • Thank you! I'll accept your answer. I didn't actually notice the issue with `use_idf=False`, that was a reflection of my experimenting to see if the setting made any difference. Which it didn't, for the second reason you mention above. – James Hamilton Jan 22 '20 at 09:18
  • 1
    I presume that the Scikit implementation uses an idf formulation like `log(N/(nt+1))` since a word appearing in all documents would render `log(N/N) = 0` if they had used Sparck-Jones's original formulation. – James Hamilton Jan 22 '20 at 09:36
  • @JamesHamilton, yes indeed! – MaximeKan Jan 22 '20 at 21:15
  • This [related answer](https://stackoverflow.com/a/70733642/17865804) addresses the same problem, providing an interesting comparison of results between sklearn's `TfidfVectorizer` and the standard Tf-idf formula. – Chris Jan 10 '23 at 18:04