TfidfVectorizer vs. definition of tf-idf

Question

For a tutorial, I want to implement manually what the TfidfVectorizer is doing, just to show what's going on in the background. In this Stack Overflow article I found how the TfidfVectorizer works. With this, it was straightforward to implement it in a naive manner, and with the correct parameter settings for the vectorizer, the output is indeed the same. All good.

However, now I'm a bit confused: The TfidfVectorizer calculates the term frequency tf using the CountVevtorizer. That means tf is just an integer representing the number of occurrences of a term in a document. But usually the term frequency tf(t,d) of term t in a document d is defined as:

tf(t,d) = (#occurrences of t in d) / (#terms in d)

So the term frequency is a value between 0 and 1.

How does this fit together? Why is using the TfidfVectorizer the term count and not the (normalized) frequency according to the definition. I assume it's no a big deal but I would to understand it.

score 0 · Answer 1 · answered Feb 01 '18 at 14:05

Usually, TfidfVectorizer using as next construction:

from sklearn.feature_extraction.text import TfidfVectorizer
features = ['1', '2', '3', '4', '5']
data = ['string1', 'string2', 'string3', 'string4', 'string5']
tfidfve = TfidfVectorizer()
tfidfve.fit_transform(data, features)

TfidfVectorizer vs. definition of tf-idf

1 Answers1