For a tutorial, I want to implement manually what the TfidfVectorizer
is doing, just to show what's going on in the background. In this Stack Overflow article I found how the TfidfVectorizer
works. With this, it was straightforward to implement it in a naive manner, and with the correct parameter settings for the vectorizer, the output is indeed the same. All good.
However, now I'm a bit confused: The TfidfVectorizer
calculates the term frequency tf
using the CountVevtorizer
. That means tf
is just an integer representing the number of occurrences of a term in a document. But usually the term frequency tf(t,d)
of term t
in a document d
is defined as:
tf(t,d) = (#occurrences of t in d) / (#terms in d)
So the term frequency is a value between 0 and 1.
How does this fit together? Why is using the TfidfVectorizer
the term count and not the (normalized) frequency according to the definition. I assume it's no a big deal but I would to understand it.