How is the TF-IDF values are calculated in scikit-learn by Python and how to seize the same result below ?
Copy Code
Document 1 : ['includ', 'name', 'function', 'type', 'argument']
Document 2 : ['name', 'function', 'type', 'argument']
##I run the following code to calculate tf-idf for the terms in both Doc 1 and Doc 2
Python
tfidf = TfidfVectorizer(tokenizer=processData, stop_words='english')
tfs = tfidf.fit_transform(rawContentDict.values())
tfs_Values = tfs.toarray()
tfs_Term = tfidf.get_feature_names()
I get the following output of tf-idf values :
Document 1 : [includ = 0.630099, name = 0.448320, function = 0.448320 , type = 0.448320, argument = 0.448320]
Document 2 : [includ = 0 , name= 0.577350 , function = 0.577350 , type= 0.577350, argument= 0.577350]
Now I don't understand how these scores are computed. I tried but I got different results than the program output. How is the TF-IDF score calculated in scikit-learn and how to seize the same result above . ?? Your help is so so so much appreciated.
What I have tried:
i read this helpful contents [1] , [2] and implement the mentioned steps and still don't get the same results!..
[2] How areTF-IDF calculated by the scikit-learn TfidfVectorizer&lq=1
Best close result I got is by following [1] stpes and it was [ includ = 0.57496] and the one I want is [ includ = 0.630099 ]
Thanks!!