4

I have a large list of 27,000 strings for which I have to find which 2 strings are similar. For this I used this python library --Levenshtein Library to find similarity between 2 strings. With the below code this is straight forward

count - 0
for i, coll1 in enumerate(college_list):
    for j, coll2 in enumerate(college_list):
        if abs(len(coll1) - len(coll2)) <= 15:
            similarity = jaro(coll1,coll2)
            if similarity >= 0.90 and similarity != 1.0 and :
                print "Probable Similar Strings"
                print coll1 + " AND " + coll2
                print "Similarity is %s" % (similarity)
                print "=" * 20
                count += 1

But as you can see their are 2 for loops that will compare one string with another, and total number of such combinations are 729000000 (27000 X 27000).

My current code is taking a lot of time to complete, and I have to vary the similarity threshold to achieve the results which fits my use-case. Running multiple iteration of this code with different similarity threshold will definitely going to take a lot of time

Is their exists a nicer and a faster way to achieve the above functionality using numpy /pandas

Community
  • 1
  • 1
Anurag Sharma
  • 4,839
  • 13
  • 59
  • 101
  • 2
    Why do you need to recalculate when you change the thresholds? Store e.g. a dictionary mapping the word pair to similarity `{(coll1, coll2): jaro(coll1, coll2), ...}` then you only have to calculate them once. Alternatively, bin the similarities (e.g. `0.05, 0.1, ...`) and store a mapping of each value to a list of word pairs with that value. – jonrsharpe May 26 '15 at 12:33
  • Looks like a duplicate of [this](http://stackoverflow.com/a/29433204/4077912) – Primer May 26 '15 at 13:57

3 Answers3

3

Before thinking about switching to numpy, I think you should calculate the similarity only for j < i it will half the computation needed if Levenshtein similarity is a bijection.

See the example below : all the "/" don't need to be calculated, if jaro("aa","aa") == 1 and jaro("ab","aa") == jaro("aa","ab").

i/j aa ab ac
aa   /  1  1
ab   /  /  1
ac   /  /  /
Pierre.Sassoulas
  • 3,733
  • 3
  • 33
  • 48
3

You are looking for itertools, which solves the looping for you by using generators which is much more efficient.

itertools.combinations also makes sure not to generate the same pair in reverse.

combinations('ABCD', 2)
AB AC AD BC BD CD

See that there is no BA or DA, because AB and AD already exists.

As you see I have dropped the string length comparison totally from my example. Simply because I do not see that there are many two names that are so different in length. I used this random name generator to generate some examples, and it never occured.

Even if it would occur a few times, the if will take time on so many other rows that it may not be worth it. Not to mention that it could produce unwanted behaviour with extremely long strings.

I made a small example here for you:

import itertools
import Levenshtein


college_list = ['Dave', 'Jack', 'Josh', 'Donald', 'Carry', 'Kerry', 'Cole', 'Coal', 'Coala']
for pair in itertools.combinations(college_list, 2):
    similarity = Levenshtein.jaro(pair[0], pair[1])
    if similarity >= 0.90 and similarity != 1.0:
        print pair, similarity

Returns

('Coal', 'Coala') 0.933333333333
firelynx
  • 30,616
  • 9
  • 91
  • 101
2

To add to the proposed improvements, I suggest you to first sort college_list by length, and then compute Levenshtein similarity only for words with difference in length <= 15. Something like

from Levenshtein import jaro

college_list.sort(key=len)
for i, coll1 in enumerate(college_list):
    for j in xrange(i + 1, len(college_list)):
        coll2 = college_list[j]
        if len(coll2) - len(coll1) > 15:
            break
        similarity = jaro(coll1,coll2)
        if similarity >= 0.90 and similarity != 1.0:
            print "Probable Similar Strings"
firelynx
  • 30,616
  • 9
  • 91
  • 101
matiasg
  • 1,927
  • 2
  • 24
  • 37