0

Suppose I have two sets,

a = {"this is a title", ...}
b = {"this is a short description of some title from a", ...}

What is the best way to find the best match in set b for an element in set a, or vice versa. The approach I tried was to create a tf-idf bag of words vector space using the tokens of b, and then finding the cosine similarity. For given a, the pair (a,b) was selected if the cosine similarity was higher than any other element b. But it is not very accurate.

Are there any better methods to do this? How can I improve the accuracy?

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# titles and abstracts are arrays of strings
tfidf = TfidfVectorizer(stop_words='english', analyzer='word')

vec = tfidf.fit_transform(abstracts)

def predict(title):
    titlevec = tfidf.transform([title])
    sim = cosine_similarity(titlevec,vec)
    return np.argmax(sim)

for i, title in titles:
    index = predict(title)
    print "Title: {0}\nAbstracts:{1}".format(title,abstracts[index])
yayu
  • 7,758
  • 17
  • 54
  • 86
  • 2
    how did you calculate the cosine similarity, and what is *"not very accurate"*? how large is your document collection (that means was it a realistically sized collection or just a toy example)? could you also post some code how you calculated the whole thing? – tttthomasssss Oct 05 '14 at 14:39
  • @tttthomasssss this not a real world problem. I just wanted to see how I can improve my accuracy, tips to enhance it, etc. Don't want anyone to solve it for me so didn't post code earlier. I have tried a few things like adding stemmers, clustering, but posted code has the highest accuracy. – yayu Oct 05 '14 at 21:09
  • then i can recommend the [IR Book](http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html) if you haven't had a look at it yet, especially [chapter 6](http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html) contains a good overview. – tttthomasssss Oct 06 '14 at 08:34

0 Answers0