0

Let's say we have a list of 50 sentences and we have an input sentence. How can i choose the closest sentence to the input sentence from the list?

I have tried many methods/algorithms such as averaging word2vec vector representations of each token of the sentence and then cosine similarity of result vectors.

For example I want the algorithm to give a high similarity score between "what is the definition of book?" and "please define book".

I am looking for a method (probably a combinations of methods) which 1. looks for semantics 2. looks for syntax 3. gives different weights for different tokens with different role (e.g. in the first example 'what' and 'is' should get lower weights)

I know this might be a bit general but any suggestion is appreciated.

Thanks,

Amir

jasemi
  • 67
  • 7
  • 1
    This is way too broad for Stack Overflow. – juanpa.arrivillaga Dec 21 '16 at 23:54
  • A difficult problem. I would suggest parsing the 50 sentences and keeping their parse trees. Then parse the incoming sentence. For each of the 50, as much as possible, compare words that have the same "part of speech" in both parse trees. Score the degree of match of the root of the word. But this problem is really wide open and one should expect to do a lot of experimentation to get best results. – mikeTronix Dec 22 '16 at 00:12

2 Answers2

2

before counting a distance between sentences, you need to clean them,

For that:

  1. A Lemmatization of your words is needed to get the root of each word, so your sentence "what is the definition of book" woul be "what be the definition of bood"

  2. You need to delete all preposition, verb to be and all Word without meaning, like : "what be the definition of bood" would be "definintion book"

  3. And then you transform you sentences into vectors of number by using tf-idf method or wordToVec.

  4. Finnaly you can count the distance between your sentences by using cosine between the vectors, so if the cosine is small so the your two sentences are similar.

Hop that will help

1

Your sentences are too sparse to compare the two documents directly. Aggressive morphological transformations (such as stemming, lemmatization, etc) might help some, but will probably fall short given your examples.

What you could do is compare the 'search results' of the 2 sentences in a large document collection with a number of methods. According to the distributional hypothesis similar sentences should occur in similar context (see Distributional hypothesis, but also Rocchio's algorithm, co-occurrence and word2vec). Those context (when gathered in a smart way) could be large enough to do some comparison (such as cosine similarity).

S van Balen
  • 288
  • 2
  • 11