Sklearn Tfidf Vectorizer norm=None norm-l2

Question

Hi I'm trying to understand how scikit-learn works out the TFIDF score in the matrix: document 1, feature 6, "wine":

test_doc = ['The wine was lovely', 'The red was delightful',
            'Terrible choice of wine', 'We had a bottle of red']

# Create vectorizer
vec = TfidfVectorizer(stop_words='english')
# Feature vector
tfidf = vec.fit_transform(test_doc)

feature_names = vec.get_feature_names()
feature_matrix = tfidf.todense()

['bottle', 'choice', 'delightful', 'lovely', 'red', 'terrible', 'wine']
[[ 0.         0.         0.         0.78528828 0.        0.         0.6191303 ]
 [ 0.         0.         0.78528828 0.         0.6191303 0.         0.        ]
 [ 0.         0.61761437 0.         0.         0.        0.61761437 0.48693426]
 [ 0.78528828 0.         0.         0.         0.6191303 0.         0.        ]]

I was using the answer to a very similar question to calculate it for myself: How areTF-IDF calculated by the scikit-learn TfidfVectorizer However in their TFIDFVectorizer, norm=None.

As I'm using the default setting of norm=l2, how does this differ to norm=None and how can I calculate it for myself?

Maybe other answers [here](https://stackoverflow.com/questions/42440621/how-term-frequency-is-calculated-in-tfidfvectorizer/42451555#42451555) and [here](https://stackoverflow.com/questions/43091235/tfidf-transform-function-not-returning-correct-values/43092569#43092569) help you — Vivek Kumar, Apr 14 '18 at 02:05
@VivekKumar, I'm trying to reproduce the answer to your calculation: log(2/1)+1 but I'm getting 1.301 not 0.693... What am I doing wrong here? — ChatNoir, Apr 14 '18 at 11:31
Check the base in your calculations. It should be ln(2/1) (with base 2). I think you are using log(2/1) with base 10. See [here](https://math.stackexchange.com/questions/90594/the-difference-between-log-and-ln) — Vivek Kumar, Apr 14 '18 at 12:20

score 0 · Answer 1 · edited Oct 19 '22 at 09:54

After some computations:

The TFIDF vectrors of the documents (tfidf.todense()) is computed from the formula:

TFIDF = tf(t,d) * idf(t,D)
tf(t,d) = how many times word-t appear into the document d (don't divide with the total words of the document)
idf(t,D) = ln ( (1 + D) / (1 + df))
- D = number of the documents
- df = We are looking into all documents, if word-t exist in adocument_i and we add +1 to df (we dont care if word-t exist many times in a documents)

Parameter norm

norm = None : Compute TFIDF vectors with the above formula
norm = 'l2' : Compute TFIDF vectors with the above formula and for each -TFIDF vector, we compute its length. So the new TFIDF vectors is computed as
TFIDF_vector_i = TFIDF_vector_i / (length_TFIDF_vector_i)
length_TFIDF_vector_i = Sqrt(Sum(tfidt[i]^2)) compute vector's length image

Sklearn Tfidf Vectorizer norm=None norm-l2

1 Answers1