0

I have two list, below, and i want to compare if words that are similar levenshtein distance of less than 2. I have a function to find the levenshtein distance, however as parameters it needs the two words. I can find which words are not in the other list, but it is not helping. And I can go index by index but as in the case below when i get to index 7 (but and except) everything is thrown off because infidelity will be index 9 and 8 and wcop88 is 9 and 10 hence those won't be compare. Is there some way to say if part of infidelity is in some word in the other list then check those two, note this won't always work because say if infidelity and infedellty there is only the in and ty that can match and many words could possibly match that

[u'rt', u'cuaimatizada', u's', u'cuaimaqueserespeta', u'forgives', u'any', u'mistake', u'but', u'the', u'infidelity', u'wocp88']
[u'rt', u'cuiamatizada', u's', u'cuimaqueserespeta', u'forgive', u'any', u'mistake', u'except', u'infedelity', u'wcop88']

Edit: So my goal is to be able to feed my levenshtein function the two words the need to be check. In this case the following pairs:

u'cuaimatizada      u'cuiamatizada

u'cuaimaqueserespeta u'cuimaqueserespeta

u'forgives   u'forgive

u'infedelity  u'infidelity

u'wocp88 u'wcop88

I do not know which words before hand.

JPvdMerwe
  • 3,328
  • 3
  • 27
  • 32
jacobLoz
  • 13
  • 1
  • 6
  • 2
    Can you clarify the question a bit? What is your goal? – Emil Vikström Jul 11 '12 at 16:28
  • Im not sure what you want either ... are you looking for `zip(list1,list2)` ? – Joran Beasley Jul 11 '12 at 16:31
  • How do you determine which words not to compare. In other words, if you don't know the words beforehand, what criteria did you use to determine that `(u'the', u'infedelity')` is wrong? – Joel Cornett Jul 11 '12 at 16:34
  • That is the issue Joel, i want to compare words that are relatively similar because it should mean its some typo error. – jacobLoz Jul 11 '12 at 16:36
  • @jacobLoz: If you don't need to use Levenshtein distance, you could try looking at [difflib.get_close_matches](http://docs.python.org/library/difflib.html#difflib.get_close_matches). – JPvdMerwe Jul 11 '12 at 17:03

1 Answers1

2

I think this is what you want ... but it compares all words... not just matching indexes

 wordpairs = [(w1,w2) for w1 in list1 for w2 in list2 if levenstein(w1,w2) < 2]

>>> matches = [(w1,w2) for w1 in l12 for w2 in l22 if levenshtein(w1,w2) < 2]

[(u'rt', u'rt'), (u's', u's'), (u'cuaimaqueserespeta', u'cuimaqueserespeta'), (u'forgives', u'forgive'), (u'any', u'any'), (u'mistake', u'mistake'), (u'infidelity',u'infedelity')]
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • 1
    or... `filter(lambda i: levenshtein(*i), itertools.product(list1, list2))` – Joel Cornett Jul 11 '12 at 16:48
  • 1
    thats probably faster so +1 ... although I think the list comprehension is moderately more readable – Joran Beasley Jul 11 '12 at 16:53
  • 1
    I'd just mention that you can speed this algorithm up by quite a bit by defining `is_levenstein_less_than_2(x,y)`. You'd want to do this because you can implement this in `O(min(|x|, |y|))` by only doing the DP along the main diagonal. – JPvdMerwe Jul 11 '12 at 17:04
  • You could speed this up (I think) by only comparing levenstein(w1,w2) after you know that the `abs(len(w1)-len(w2))<=2`. By definition, if the difference in length between w1 and w2 is greater than 2, the edit distance will also be greater than 2. +1 tho! – dawg Jul 11 '12 at 18:44
  • I think thats what the comment by JPvdMerwe is implying .... not sure but I think – Joran Beasley Jul 11 '12 at 20:32