0

My goal is to compare the text txt with each item in corpus below using TFIDF weighting scheme.

corpus=['the school boy is reading', 'who is reading a comic?', 'the little boy is reading']

txt='James the school boy is always busy reading'

Here's my implementation:

TFIDF=term frequency-inverse document frequence=tf * log (n/df) n=number of documents in the corpus---3 in this case

import collections
from collections import Counter
from math import log

txt2=Counter(txt.split())
corpus2=[Counter(x.split()) for x in corpus]
def tfidf(doc,_corpus):
    dic=collections.defaultdict(int)
    for x in _corpus:
       for y in x:
          dic[y] +=1
    for x in doc:
       if x not in dic:dic[x]=1.
    return {x : doc[x] * log(3.0/dic[x])for x in doc}

txt_tfidf=tfidf(txt2, corpus2)
corpus_tfidf=[tfidf(x, corpus2) for x in corpus2]

Results

print txt_tfidf
    {'boy': 0.4054651081081644, 'school': 1.0986122886681098, 'busy': 1.0986122886681098, 'James': 1.0986122886681098,
     'is': 0.0, 'always': 1.0986122886681098, 'the': 0.4054651081081644, 'reading': 0.0}
for x in corpus_tfidf:
    print x
{'boy': 0.4054651081081644, 'the': 0.4054651081081644, 'reading': 0.0, 'school': 1.0986122886681098, 'is': 0.0}
{'a': 1.0986122886681098, 'is': 0.0, 'who': 1.0986122886681098, 'comic?': 1.0986122886681098, 'reading': 0.0}
{'boy': 0.4054651081081644, 'the': 0.4054651081081644, 'reading': 0.0, 'little': 1.0986122886681098, 'is': 0.0}

I'm not quite sure if i'm right because rare terms such as James and comic should have higher TFIDF weights than common term like school.

Any suggestions will be appreciated.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
user2274879
  • 349
  • 1
  • 5
  • 16
  • While 'school' is perhaps more common in English, 'school' has the same distribution as 'comic?' in your data set, so it appears that their scores should be similar – confuser Jun 20 '14 at 22:33
  • @confuser, but school appears twice while comic appear once, or is itjust about the corpus? And please let me know if my implementation is correct and any suggestions on how to make comparison. Thanks. – user2274879 Jun 20 '14 at 22:41
  • Ah, I see. I _believe_ the all the sentences (including `txt`) need to be part of the corpus for TF-IDF to really make sense. (Maybe someone more knowledgeable on this stuff can confirm this?) – confuser Jun 20 '14 at 22:51

1 Answers1

3

First of all, as @confuser told in comments, let put txt in corpus and get rid of this code:

for x in doc:
   if x not in dic:dic[x]=1.

After that, I want to add a . to your code cause a dot in coding, is like salt in cooking. ;)

    for y in x:
        dic[y] += 1.

Ohh, I also see some magic numbers in your code. Excuse me but they make me nervous, so we have:

return {x: doc[x] * log(len(_corpus) / dic[x]) for x in doc}

With all of these little modifications, we can see the result of code below:

import collections
from collections import Counter
from math import log

corpus = ['the school boy is reading', 'who is reading a comic?', 'the little boy is reading',
          'James the school boy is always busy reading']

txt = corpus[-1]

txt2 = Counter(txt.split())
corpus2 = [Counter(x.split()) for x in corpus]


def tfidf(doc, _corpus):
    dic = collections.defaultdict(int)
    for x in _corpus:
        for y in x:
            dic[y] += 1.
    return {x: doc[x] * log(len(_corpus) / dic[x]) for x in doc}


txt_tfidf = tfidf(txt2, corpus2)
corpus_tfidf = [tfidf(x, corpus2) for x in corpus2]

print txt_tfidf

It seems normal to me to 'boy' have much less tf_idf than 'busy'. Do you agree?

{'boy': 0.28768207245178085, 'school': 0.6931471805599453, 'busy': 1.3862943611198906, 'James': 1.3862943611198906, 'is': 0.0, 'always': 1.3862943611198906, 'the': 0.28768207245178085, 'reading': 0.0}
Mehraban
  • 3,164
  • 4
  • 37
  • 60
  • many thanks for your contribution, your answer and explanation is just perfect. I got exactly the same answer about an hour ago. – user2274879 Jun 21 '14 at 08:59