My goal is to compare the text txt with each item in corpus below using TFIDF weighting scheme.
corpus=['the school boy is reading', 'who is reading a comic?', 'the little boy is reading']
txt='James the school boy is always busy reading'
Here's my implementation:
TFIDF=term frequency-inverse document frequence=tf * log (n/df) n=number of documents in the corpus---3 in this case
import collections
from collections import Counter
from math import log
txt2=Counter(txt.split())
corpus2=[Counter(x.split()) for x in corpus]
def tfidf(doc,_corpus):
dic=collections.defaultdict(int)
for x in _corpus:
for y in x:
dic[y] +=1
for x in doc:
if x not in dic:dic[x]=1.
return {x : doc[x] * log(3.0/dic[x])for x in doc}
txt_tfidf=tfidf(txt2, corpus2)
corpus_tfidf=[tfidf(x, corpus2) for x in corpus2]
Results
print txt_tfidf
{'boy': 0.4054651081081644, 'school': 1.0986122886681098, 'busy': 1.0986122886681098, 'James': 1.0986122886681098,
'is': 0.0, 'always': 1.0986122886681098, 'the': 0.4054651081081644, 'reading': 0.0}
for x in corpus_tfidf:
print x
{'boy': 0.4054651081081644, 'the': 0.4054651081081644, 'reading': 0.0, 'school': 1.0986122886681098, 'is': 0.0}
{'a': 1.0986122886681098, 'is': 0.0, 'who': 1.0986122886681098, 'comic?': 1.0986122886681098, 'reading': 0.0}
{'boy': 0.4054651081081644, 'the': 0.4054651081081644, 'reading': 0.0, 'little': 1.0986122886681098, 'is': 0.0}
I'm not quite sure if i'm right because rare terms such as James and comic should have higher TFIDF weights than common term like school.
Any suggestions will be appreciated.