1

I have a list of strings and I want to filter out the strings that are too similar based on levenstein distance. So if lev(list[0], list[10]) < 50; then del list[10]. Is there any way I can calculate such distance between every pair of strings in the list, more efficiently?? Thanks!!

data2= []
for i in data:
    for index, j in enumerate(data):
        s = levenshtein(i, j)
        if s < 50:
            del data[index]
    data2.append(i)

The rather dumb code above is taking too long to compute...

Blue482
  • 2,926
  • 5
  • 29
  • 40
  • Need more information in order to answer. The Levenshtein algorithm is said to be slow. Also, data and data1 are not defined. Have you looked at, http://www.levenshtein.net? Exobyte? Have you used python's profiler? – xxyzzy Apr 04 '15 at 14:51
  • Levenshtein is symmetric. You might want to construct your nested for-loops accordingly. See, http://stackoverflow.com/questions/9722022/levenshtein-distance-symmetric – xxyzzy Apr 04 '15 at 14:56
  • Thanks. I am using the 5th Levenshtein implementation from [link](http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python). data is a list that consists of 6000 sentences, and I want to only keep one of very similar sentences pair.. – Blue482 Apr 04 '15 at 15:23
  • 1
    have you considered using the memorised version (http://rosettacode.org/wiki/Levenshtein_distance#Python), if the word are particularly similar this might save you a lot of time. secondly if you are deleting as you go will the indexes in the enumerate not get out of sync with the index of data, you could simply store a list to get arround this. – user2539336 Apr 04 '15 at 15:36
  • Thanks. The memorised version actually takes longer to compute in my case... – Blue482 Apr 04 '15 at 16:10
  • 1
    You can try https://github.com/ztane/python-Levenshtein/ C-based version of Levenstein algo is much faster than in Python. If you want further speed-up, you can add some logic like compare some first 10 characters of strings, if they are similar enough, search for similarity in the full string. Then you can parallelize using multiprocessing module. With these basic ideas, I get immense speed-ups on the order of 100X, compared to very raw, sequential, full-string comparisons. – Gökhan Sever Apr 16 '15 at 21:19

1 Answers1

1

What if we kept only the indexes of the hit-strings and just skipped them later? I ignore how much enumerate() and del() weigh and what the percentage of hits is (i.e. how many strings must be removed from your dataset).

THRESHOLD = 50
data = ["hel", "how", "are", "you"] # replace with your dataset

tbr = {} # holds the index of the strings to be removed
idx = 0
for i in data:
    for j in xrange(len(data)):
        if j != idx and levenshtein(i, data[j]) < THRESHOLD:
            tbr[j] = True
    idx += 1

# print tbr
data2 = []
idx = -1
for d in data:
    idx += 1
    if idx in tbr:
        continue # skip this string
    data2.append(d)
# print data2
Pynchia
  • 10,996
  • 5
  • 34
  • 43
  • Thanks! But the 'holds the index of the strings to be removed' step is still taking like forever... – Blue482 Apr 04 '15 at 19:48