4

There are several questions on SO and the web describing how to take the cosine similarity between two strings, and even between two strings with TFIDF as weights. But the output of a function like scikit's linear_kernel confuses me a little.

Consider the following code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

a = ['hello world', 'my name is', 'what is your name?']
b = ['my name is', 'hello world', 'my name is what?']

df = pd.DataFrame(data={'a':a, 'b':b})
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1)
print(df.head())

                    a                 b                                   ab
0         hello world        my name is               hello world my name is
1          my name is       hello world               my name is hello world
2  what is your name?  my name is what?  what is your name? my name is what?

Question: I'd like to have a column that is the cosine similarity between the strings in a and the strings in b.

What I tried:

I trained a TFIDF classifier on ab, so as to include all the words:

clf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
clf.fit(df['ab'])

I then got the sparse TFIDF matrix of both a and b columns:

tfidf_a = clf.transform(df['a'])
tfidf_b = clf.transform(df['b'])

Now, if I use scikit's linear_kernel, which is what others recommended, I get back a Gram matrix of (nfeatures,nfeatures), as mentioned in their docs.

from sklearn.metrics.pairwise import linear_kernel
linear_kernel(tfidf_a,tfidf_b)

array([[ 0.,  1.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

But what I need is a simple vector, where the first element is the cosin_sim between the first row of a and the first row of b, the second element is the cos_sim(a[1],b[1]), and so forth.

Using python3, scikit-learn 0.17.

tijko
  • 7,599
  • 11
  • 44
  • 64
David
  • 1,454
  • 3
  • 16
  • 27

2 Answers2

3

I think your example is falling down a little bit because your TfidfVectorizer is filtering out the majority of your words because you have the stop_words = 'english' parameter (you've included almost all stop words in the example). I've removed that and made your matrices dense so we can see what's happening. What if you did something like this?

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial

a = ['hello world', 'my name is', 'what is your name?']
b = ['my name is', 'hello world', 'my name is what?']

df = pd.DataFrame(data={'a':a, 'b':b})
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1)

clf = TfidfVectorizer(ngram_range=(1, 1))
clf.fit(df['ab'])

tfidf_a = clf.transform(df['a']).todense()
tfidf_b = clf.transform(df['b']).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
row_similarities

[0.0, 0.0, 0.72252389079716417]

This shows the distance between each row. I'm not fully on board with how you're building the full corpus, but the example isn't optimized at all, so I'll leave that for now. Hope this helps.

flyingmeatball
  • 7,457
  • 7
  • 44
  • 62
  • thanks, this worked. Why aren't you on board with how I'm building the full corpus? – David Apr 25 '16 at 00:13
  • Because there's usually a better way to do it than using a .apply for this type of task. Are there 6 documents here, 3 rows in the two columns, are there two separate documents,(a and b), or are there 3 documents (one per row). It matters for calculating the frequency in the TFIDF, and I'm not sure that the way you're constructing ab now reflects what you're meaning to do. – flyingmeatball Apr 25 '16 at 01:41
1
dfs = {}
idfs = {}
speeches = {}
speechvecs = {}
total_word_counts = {}

def tokenize(doc):
    tokens = mytokenizer.tokenize(doc)
    lowertokens = [token.lower() for token in tokens]
    filteredtokens = [stemmer.stem(token) for token in lowertokens if not token in sortedstopwords]
    return filteredtokens

def incdfs(tfvec):
    for token in set(tfvec):
        if token not in dfs:
            dfs[token]=1
            total_word_counts[token] = tfvec[token]
        else:
            dfs[token] += 1
            total_word_counts[token] += tfvec[token]


def calctfidfvec(tfvec, withidf):
    tfidfvec = {}
    veclen = 0.0

    for token in tfvec:
        if withidf:
            tfidf = (1+log10(tfvec[token])) * getidf(token)
        else:
            tfidf = (1+log10(tfvec[token]))
        tfidfvec[token] = tfidf 
        veclen += pow(tfidf,2)

    if veclen > 0:
        for token in tfvec: 
            tfidfvec[token] /= sqrt(veclen)

    return tfidfvec

def cosinesim(vec1, vec2):
    commonterms = set(vec1).intersection(vec2)
    sim = 0.0
    for token in commonterms:
        sim += vec1[token]*vec2[token]

    return sim

def query(qstring):
    qvec = getqvec(qstring.lower())
    scores = {filename:cosinesim(qvec,tfidfvec) for filename, tfidfvec in speechvecs.items()}  
    return max(scores.items(), key=operator.itemgetter(1))[0]

def docdocsim(filename1,filename2):
    return cosinesim(gettfidfvec(filename1),gettfidfvec(filename2))
  • 1
    While this code snippet may solve the problem, it doesn't explain why or how it answers the question. Please [include an explanation for your code](//meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers), as that really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Scott Weldon Oct 20 '16 at 03:12
  • I find this code self-documenting and I don't even know python. – Seth Oct 03 '19 at 22:43
  • I feel like there should be a cosine function somewhere within the cosine similarity, yet there is none. Why? – Adam Bajger Apr 03 '21 at 10:01