There are several questions on SO and the web describing how to take the cosine similarity
between two strings, and even between two strings with TFIDF as weights. But the output of a function like scikit's linear_kernel
confuses me a little.
Consider the following code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
a = ['hello world', 'my name is', 'what is your name?']
b = ['my name is', 'hello world', 'my name is what?']
df = pd.DataFrame(data={'a':a, 'b':b})
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1)
print(df.head())
a b ab
0 hello world my name is hello world my name is
1 my name is hello world my name is hello world
2 what is your name? my name is what? what is your name? my name is what?
Question:
I'd like to have a column that is the cosine similarity between the strings in a
and the strings in b
.
What I tried:
I trained a TFIDF classifier on ab
, so as to include all the words:
clf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
clf.fit(df['ab'])
I then got the sparse TFIDF matrix of both a
and b
columns:
tfidf_a = clf.transform(df['a'])
tfidf_b = clf.transform(df['b'])
Now, if I use scikit's linear_kernel
, which is what others recommended, I get back a Gram matrix of (nfeatures,nfeatures), as mentioned in their docs.
from sklearn.metrics.pairwise import linear_kernel
linear_kernel(tfidf_a,tfidf_b)
array([[ 0., 1., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
But what I need is a simple vector, where the first element is the cosin_sim between the first row of a
and the first row of b
, the second element is the cos_sim(a[1],b[1]), and so forth.
Using python3, scikit-learn 0.17.