-1

From my code below:

def dot(docA,docB):
    the_sum=0
    for (key,value) in docA.items():
        the_sum+=value*docB.get(key,0)
    return the_sum

def cos_sim(docA,docB):
    sim=dot(docA,docB)/(math.sqrt(dot(docA,docA)*dot(docB,docB)))
    return sim

def doc_freq(doclist):
    df={}
    for doc in doclist:
        for feat in doc.keys():
            df[feat]=df.get(feat,0)+1
    return df

def idf(doclist):
    N=len(doclist)
    return {feat:math.log(N/v) for feat,v in doc_freq(doclist).items()} 


tf_med=doc_freq(bow_collections["medline"])
tf_wsj=doc_freq(bow_collections["wsj"])

idf_med=idf(bow_collections["medline"])
idf_wsj=idf(bow_collections["wsj"])

print(tf_med)
print(idf_med)

So I've managed to finally get this far, though I can't seem to find information on what I have to do next in terms of Python, sure the maths is there but I don't feel it necessary to spend hours trying to understand what it means. Just a quick reassurance this is what I get from tf_med:

{'NUM': 37, 'early': 3, 'case': 3, 'organ': 1, 'transplantation': 1, 'section': 1, 
'healthy': 1, 'ovary': 1, 'fertile': 1, 'woman': 1, 'unintentionally': 1, 
'unknowingly': 1, 'subjected': 1, 'oophorectomy': 1, 'described': 4, .... , }

And here is what I get from idf_med:

{'NUM': 0.3011050927839216, 'early': 2.8134107167600364, 'case': 2.8134107167600364, 
'organ': 3.912023005428146, 'transplantation': 3.912023005428146, 'section': 
3.912023005428146, 'healthy': 3.912023005428146, 'ovary': 3.912023005428146, 'fertile': 
3.912023005428146, .... , }

Though now I don't know how to compute these two together to get me my TF-IDF and from there my average cosine similarities. I understand they need to be multiplied but how on earth do I go about doing that!

bemzoo
  • 172
  • 14
  • You know how to do the math, but not how to code. We know how to code, but don't know the math. One of us has to know both, or you need to provide to us what you plan to do and what you are seeking – offeltoffel Dec 06 '18 at 13:55
  • So the dictionaries are the same size, and it is the inverse document frequency (from `idf_med`) of each index, that must be multiplied by the same index in the other dictionary. So you have `'NUM': 37` * `'NUM':0.3011050927839216` – bemzoo Dec 06 '18 at 13:59
  • I believed I've achieved it with: `tfidf_med={k: tf_med[k]*idf_med[k] for k in tf_med}` – bemzoo Dec 06 '18 at 14:11
  • Well, iterating over keys is possible. Whatever tf-idf is, I'm glad you came up with a solution yourself... – offeltoffel Dec 06 '18 at 14:23

1 Answers1

0

You can use scikit-learn:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
text1 ='eat big yellow bananas'
text2 ='eat big yellow potatos'
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([text1,text2])
similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)