2

I have two tables containing 2 million records each. One has the item names and other item description along with other attributes. I have to match each item in table 1 with each description in table 2 to find maximum similarity matches. So basically, for each of 2 million items, I have to scan the other table to find best match. That makes 2 million * 2 million computations! How do I go about doing that in python efficiently? As it stands now, it will take years to compute.

Right now the approach I am following is regex search by splitting each item name into words in a list and then checking if the word is contained in description or not.If yes, then I increase the match count by 1 and using that I calculate similarity.

So my question(s) is :

  1. How to make my computations faster? Use multithreading, split data or anything like this?

  2. Any other similarity algorithm that will work here? Please note that I have description on the other side, so cosine similarity etc don't work because of differing number of words.

stkusr1234
  • 61
  • 7

2 Answers2

0

You could try the Distance package to calculate the Levenshtein Distance for similarity.

From the documentation:

Comparing lists of strings can also be useful for computing similarities between sentences, paragraphs, etc., in articles or books, as for plagiarism recognition:

>>> sent1 = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
>>> sent2 = ['the', 'lazy', 'fox', 'jumps', 'over', 'the', 'crazy', 'dog']
>>> distance.levenshtein(sent1, sent2)
3

Or the python-Levenshtein package:

>>> distance('Levenshtein', 'Lenvinsten')
4

>>> distance('Levenshtein', 'Levensthein')
2
>>> distance('Levenshtein', 'Levenshten')
1
>>> distance('Levenshtein', 'Levenshtein')
0
salomonderossi
  • 2,180
  • 14
  • 20
0

you can use NLTK as well.

from nltk import *
reference = 'DET NN VB DET JJ NN NN IN DET NN'.split()
test    = 'DET VB VB DET NN NN NN IN DET NN'.split()
print(accuracy(reference, test))
print edit_distance("rain", "shine")
arshpreet
  • 679
  • 2
  • 11
  • 27
  • accuracy function requires reference and test to be of same length. I do not have test strings of same length – stkusr1234 Jul 01 '16 at 12:41