How to make text search and similarity computation across millions of records efficient in python

Question

I have two tables containing 2 million records each. One has the item names and other item description along with other attributes. I have to match each item in table 1 with each description in table 2 to find maximum similarity matches. So basically, for each of 2 million items, I have to scan the other table to find best match. That makes 2 million * 2 million computations! How do I go about doing that in python efficiently? As it stands now, it will take years to compute.

Right now the approach I am following is regex search by splitting each item name into words in a list and then checking if the word is contained in description or not.If yes, then I increase the match count by 1 and using that I calculate similarity.

So my question(s) is :

How to make my computations faster? Use multithreading, split data or anything like this?
Any other similarity algorithm that will work here? Please note that I have description on the other side, so cosine similarity etc don't work because of differing number of words.

I am fetching the data from mongodb and putting it into python dataframe. — stkusr1234, Jul 01 '16 at 10:50

score 0 · Answer 1 · answered Jul 01 '16 at 10:55

You could try the Distance package to calculate the Levenshtein Distance for similarity.

From the documentation:

Comparing lists of strings can also be useful for computing similarities between sentences, paragraphs, etc., in articles or books, as for plagiarism recognition:

>>> sent1 = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
>>> sent2 = ['the', 'lazy', 'fox', 'jumps', 'over', 'the', 'crazy', 'dog']
>>> distance.levenshtein(sent1, sent2)
3

Or the python-Levenshtein package:

>>> distance('Levenshtein', 'Lenvinsten')
4

>>> distance('Levenshtein', 'Levensthein')
2
>>> distance('Levenshtein', 'Levenshten')
1
>>> distance('Levenshtein', 'Levenshtein')
0

I tried this. But as I said there is description on the other end, it is not giving me good similarity. — stkusr1234, Jul 01 '16 at 11:05

score 0 · Answer 2 · answered Jul 01 '16 at 11:55

0

you can use NLTK as well.

from nltk import *
reference = 'DET NN VB DET JJ NN NN IN DET NN'.split()
test    = 'DET VB VB DET NN NN NN IN DET NN'.split()
print(accuracy(reference, test))
print edit_distance("rain", "shine")

answered Jul 01 '16 at 11:55

arshpreet

679
2
11
27

accuracy function requires reference and test to be of same length. I do not have test strings of same length – stkusr1234 Jul 01 '16 at 12:41

How to make text search and similarity computation across millions of records efficient in python

2 Answers2