1

I'm looking for an efficient way of creating a similarity vector of a single sentence against a list of sentences.

The trivial way of doing that is by iterating over the list of sentences and detect similarity between the single sentence and each one of the sentences in the list. This solution is too slow and I'm looking for a faster way of doing that.

My final goal is to detect if there is a really similar sentence in the list of sentences to the one I'm checking, if so I'll go to next sentence.

My solution right now is:

for single_sentence in list_of_sentences:
    similarity_score = word2vec.sentences_similarity(sentence2test, single_sentence)
    if similarity_score >= similarity_th:
       ignore_sent_flag = True
       break 
list_of_sentences.append(sentence2test)

Iv'e tried to put 'list_of_sentences' in a dictionary/set but the improvement in terms of time is minor.

I came across this solution but it is based on a Linux only package so no relevant for me.

Lior Magen
  • 1,533
  • 2
  • 15
  • 33
  • Are you interested in one-to-all or all-to-all similarity checking? Also does the solution need to be gensim based? – Gökhan Sever Apr 22 '16 at 01:42
  • @GökhanSever I'm interested in one-to-all while the 'all' list is being grown – Lior Magen Apr 24 '16 at 06:31
  • If your solution doesn't require gensim, you can simply compute the Jaccard similarity, either based on n-character-grams or word-grams. – Gökhan Sever Apr 24 '16 at 14:45
  • The solution requires Gensim actually. – Lior Magen Apr 24 '16 at 15:04
  • @LiorMagen any update on how you resolved this? – Anish Sep 12 '18 at 00:13
  • @Anish Yes. I've created a matrix which contains all my vectors and created the same matrix but transposed and multiplied the two. This way you're getting a solution vector which contains the multiplication of each two vectors (the similarity between them). – Lior Magen Sep 12 '18 at 06:05

2 Answers2

0

I would like to suggest 2 things: 1. Try putting 'list_of_sentences' in a file 2. Loop over the file with regular expressions it's faster.

Icelander
  • 1
  • 2
  • 'list_of_sentences's size is dynamic, if the similarity is below a given threshold I'm adding 'sentence2test' to 'list_of_sentences' so it sounds like a waste of time to save a file a load it so many times. I'm looking for a method that will use the fact that this are Numpy objects. – Lior Magen Apr 21 '16 at 08:28
0

Hash your sentences using LSH (1) and only test the sentences in the hash bucket that your candidate matched. Instead of comparing all sentences, you will need to test only a much smaller subset.

(1) How to understand Locality Sensitive Hashing?

Community
  • 1
  • 1
fnl
  • 4,861
  • 4
  • 27
  • 32