2

Hi I'm trying to understand how scikit-learn works out the TFIDF score in the matrix: document 1, feature 6, "wine":

test_doc = ['The wine was lovely', 'The red was delightful',
            'Terrible choice of wine', 'We had a bottle of red']

# Create vectorizer
vec = TfidfVectorizer(stop_words='english')
# Feature vector
tfidf = vec.fit_transform(test_doc)

feature_names = vec.get_feature_names()
feature_matrix = tfidf.todense()

['bottle', 'choice', 'delightful', 'lovely', 'red', 'terrible', 'wine']
[[ 0.         0.         0.         0.78528828 0.        0.         0.6191303 ]
 [ 0.         0.         0.78528828 0.         0.6191303 0.         0.        ]
 [ 0.         0.61761437 0.         0.         0.        0.61761437 0.48693426]
 [ 0.78528828 0.         0.         0.         0.6191303 0.         0.        ]]

I was using the answer to a very similar question to calculate it for myself: How areTF-IDF calculated by the scikit-learn TfidfVectorizer However in their TFIDFVectorizer, norm=None.

As I'm using the default setting of norm=l2, how does this differ to norm=None and how can I calculate it for myself?

ChatNoir
  • 415
  • 8
  • 18
  • 1
    Maybe other answers [here](https://stackoverflow.com/questions/42440621/how-term-frequency-is-calculated-in-tfidfvectorizer/42451555#42451555) and [here](https://stackoverflow.com/questions/43091235/tfidf-transform-function-not-returning-correct-values/43092569#43092569) help you – Vivek Kumar Apr 14 '18 at 02:05
  • @VivekKumar this is really helpful, thank you! – ChatNoir Apr 14 '18 at 10:13
  • @VivekKumar, I'm trying to reproduce the answer to your calculation: log(2/1)+1 but I'm getting 1.301 not 0.693... What am I doing wrong here? – ChatNoir Apr 14 '18 at 11:31
  • 1
    Check the base in your calculations. It should be ln(2/1) (with base 2). I think you are using log(2/1) with base 10. See [here](https://math.stackexchange.com/questions/90594/the-difference-between-log-and-ln) – Vivek Kumar Apr 14 '18 at 12:20

1 Answers1

0

After some computations:

The TFIDF vectrors of the documents (tfidf.todense()) is computed from the formula:

  • TFIDF = tf(t,d) * idf(t,D)

  • tf(t,d) = how many times word-t appear into the document d (don't divide with the total words of the document)

  • idf(t,D) = ln ( (1 + D) / (1 + df))

    • D = number of the documents
    • df = We are looking into all documents, if word-t exist in adocument_i and we add +1 to df (we dont care if word-t exist many times in a documents)

Parameter norm

  • norm = None : Compute TFIDF vectors with the above formula

  • norm = 'l2' : Compute TFIDF vectors with the above formula and for each -TFIDF vector, we compute its length. So the new TFIDF vectors is computed as

  • TFIDF_vector_i = TFIDF_vector_i / (length_TFIDF_vector_i)

  • length_TFIDF_vector_i = Sqrt(Sum(tfidt[i]^2)) compute vector's length image

thedemons
  • 1,139
  • 2
  • 9
  • 25