Suppose I have two sets,
a = {"this is a title", ...}
b = {"this is a short description of some title from a", ...}
What is the best way to find the best match in set b for an element in set a, or vice versa. The approach I tried was to create a tf-idf bag of words vector space using the tokens of b, and then finding the cosine similarity. For given a, the pair (a,b) was selected if the cosine similarity was higher than any other element b. But it is not very accurate.
Are there any better methods to do this? How can I improve the accuracy?
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# titles and abstracts are arrays of strings
tfidf = TfidfVectorizer(stop_words='english', analyzer='word')
vec = tfidf.fit_transform(abstracts)
def predict(title):
titlevec = tfidf.transform([title])
sim = cosine_similarity(titlevec,vec)
return np.argmax(sim)
for i, title in titles:
index = predict(title)
print "Title: {0}\nAbstracts:{1}".format(title,abstracts[index])