How TfidfVectorizer values are caluclated in scikit-learn. As I got different values when I've implemented TF_IDF manaually?

Question

How is the TF-IDF values are calculated in scikit-learn by Python and how to seize the same result below ?

Copy Code

Document 1 : ['includ', 'name', 'function', 'type', 'argument']

Document 2 : ['name', 'function', 'type', 'argument']

##I run the following code to calculate tf-idf for the terms in both Doc 1 and Doc 2

Python

 tfidf = TfidfVectorizer(tokenizer=processData, stop_words='english')

 tfs = tfidf.fit_transform(rawContentDict.values())

 tfs_Values = tfs.toarray()

  tfs_Term = tfidf.get_feature_names()

I get the following output of tf-idf values :

Document 1 : [includ = 0.630099, name = 0.448320, function = 0.448320 , type = 0.448320, argument = 0.448320]

Document 2 : [includ = 0 , name= 0.577350 , function = 0.577350 , type= 0.577350, argument= 0.577350]

Now I don't understand how these scores are computed. I tried but I got different results than the program output. How is the TF-IDF score calculated in scikit-learn and how to seize the same result above . ?? Your help is so so so much appreciated.

What I have tried:

i read this helpful contents [1] , [2] and implement the mentioned steps and still don't get the same results!..

[1] https://towardsdatascience.com/measure-text-weight-using-tf-idf-in-python-plain-code-and-scikit-learn-50cb1e4375ad

[2] How areTF-IDF calculated by the scikit-learn TfidfVectorizer&lq=1

Best close result I got is by following [1] stpes and it was [ includ = 0.57496] and the one I want is [ includ = 0.630099 ]

Thanks!!

It is because `name` is a stop word, it is exlucded from the calculation. Try `TfidfVectorizer(stop_words=None)` and you will get `include = 0.574962` — wong.lok.yin, Aug 26 '21 at 04:37

How TfidfVectorizer values are caluclated in scikit-learn. As I got different values when I've implemented TF_IDF manaually?

0 Answers0