0

I have a small 30 line text file with two similar words on each line. I need to calculate the levenshtein distance between the two words on each line. I also need to use a memoize function while calculating the distance. I am pretty new to Python and algorithms in general, so this is proving to be quite difficult of me. I have the file open and being read, but I cannot figure out how to assign each of the two words to variables 'a' & 'b' to calculate the distance.

Here is my current script that ONLY prints the document as of right now:

txt_file = open('wordfile.txt', 'r')

def memoize(f):
    cache = {}
    def wrapper(*args, **kwargs):
        try:
            return cache[args]
        except KeyError:
            result = f(*args, **kwargs)
            cache[args] = result
            return result
    return wrapper

@memoize
def lev(a,b):
    if len(a) > len(b):
        a,b = b,a
        b,a = a,b

current = range(a+1)
for i in range(1,b+1):
    previous, current = current, [i]+[0]*n
    for j in range(1,a+1):
        add, delete = previous[j]+1, current[j-1]+1
        change = previous[j-1]
        if a[j-1] != b[i-1]:
            change = change + 1
        current[j] = min(add, delete, change)

return current[b]

if __name__=="__main__":
    with txt_file as f:
        for line in f:
            print line

Here are a few words from the text file so you all get an idea:

archtypes, archetypes

propietary, proprietary

recogize, recognize

exludes, excludes

tornadoe, tornado

happenned, happened

vacinity, vicinity

HERE IS AN UPDATED VERSION OF THE SCRIPT, STILL NOT FUNCTIONAL BUT BETTER:

class memoize:
    def __init__(self, function):
    self.function = function
    self.memoized = {}

def __call__(self, *args):
    try:
      return self.memoized[args]
    except KeyError:
      self.memoized[args] = self.function(*args)
      return self.memoized[args]

@memoize
def lev(a,b):
    n, m = len(a), len(b)
    if n > m:
        a, b = b, a
        n, m = m, n
    current = range(n + 1)
    for i in range(1, m + 1):
        previous, current = current, [i] + [0] * n
        for j in range(1, n + 1):
            add, delete = previous[j] + 1, current[j - 1] + 1
            change = previous[j - 1]
            if a[j - 1] != b[i - 1]:
                change = change + 1
            current[j] = min(add, delete, change)
    return current[n]

if __name__=="__main__":
    for pair in open("wordfile.txt", "r"):
        a,b = pair.split()
        lev(a, b)
Ty Bailey
  • 2,392
  • 11
  • 46
  • 79
  • 2
    It's a good practice to keep your definitions (memoize, lev, etc) and your actual tasks(reading file, looping) separate. I.e. keep all definitions before `if __name__=='__main__':` and all the main work of your script right under this `if` statement. As such it would be nice to have the `open` call after the `__name__` check. I feel `current = range(a+1)` is part of your `lev` implementation, try to indent it right. Now could you also show a few lines from `wordfile.txt` for more clarity ? – Abhishek Mishra Oct 09 '12 at 16:05
  • What constitutes a word in this scenoria? I assume anything with letters only, but is that the assumption you are making? – grieve Oct 09 '12 at 16:06
  • Yes, anything with letters only. The words are very simple and very similar without a few letters off in each word. I added a few words from the file into the question for clarity. – Ty Bailey Oct 09 '12 at 16:07
  • You lev() function doesn't appear to return anything? Is this intentional? – grieve Oct 09 '12 at 16:10
  • Yes it is intentional for now because I am not sure how to implement it into the text file. – Ty Bailey Oct 09 '12 at 16:11
  • 1
    Hmm, your updated code does return values, just `print lev(a, b)` in the main loop and see :) – Abhishek Mishra Oct 09 '12 at 17:06
  • You are correct, this works. Thank you! Any idea how I can get it to print so it outputs like this: "word1a, word1b, lev(word1a, word1b), numcalls1" where numcalls1 = the number of times the function is called for each distance computation? – Ty Bailey Oct 09 '12 at 18:08
  • You might want your function to return multiple values for this. You could make it return a tuple with all the info you want. E.g. `return ("tom", 4, anyObj)` and then, the caller can unpack it as `foo, bar, beep = lev(x,y)` – Abhishek Mishra Oct 10 '12 at 02:23

2 Answers2

2

Assuming the issue is with passing of words to lev. And assuming your wordfile is something like this -

bat, man
cat, goat
foo, bar

You could do something like this then -

if __name__ == '__main__':

    for pair in open("wordfile", "r"):

        # first, remove all spaces, then break around the comma
        a,b = pair.replace(' ', '').split(',')

        # pass these words to lev
        lev(a, b)
Abhishek Mishra
  • 5,002
  • 8
  • 36
  • 38
  • Okay this allowed to me assign the words to the a & b variables, but now I am getting an error like this "cannot concatenate 'str' and 'int' objects" in which I am not using any integers? EDIT: The error is coming from the line `current = range(a+1)` – Ty Bailey Oct 09 '12 at 16:15
  • 1
    You are adding a+1, but a is a string "range(a+1)" – grieve Oct 09 '12 at 16:16
  • 2
    If you wanted to produce a range that is 1 more than length of the variable `a` (which is a string), you'd do `range(len(a) + 1)` – Abhishek Mishra Oct 09 '12 at 16:20
  • hmm, it still isn't working. Everytime I change the range to what you said above I get errors all the way down the script... here are the two main errors: "cannot concatenate 'str' and 'int' objects" & "list indices must be integers, not str" – Ty Bailey Oct 09 '12 at 16:27
  • Could you paste your latest script on http://gist.github.com or somewhere else and pass a link here? – Abhishek Mishra Oct 09 '12 at 16:28
  • Here is the gist: https://gist.github.com/10318a711f5e7a790948 link, I set it to private so if that doesn't work here is the git: git@gist.github.com:10318a711f5e7a790948.git I removed the `range(len(a) + 1)` for now. – Ty Bailey Oct 09 '12 at 16:33
  • I also added a different version than the one in the gist to the question above.. – Ty Bailey Oct 09 '12 at 16:41
0

I figured out the answer to this question with some help from Abhishek's answer and comments. Here is the final functioning script in case anyone else needs it:

def memoize(f):
    cache = {}
    def wrapper(*args, **kwargs):
        try:
            return cache[args]
        except KeyError:
            result = f(*args, **kwargs)
            cache[args] = result
            return result
    return wrapper

@memoize
def lev(a,b):
    n, m = len(a), len(b)
    if n > m:
        a, b = b, a
        n, m = m, n
    current = range(n + 1)
    for i in range(1, m + 1):
        previous, current = current, [i] + [0] * n
        for j in range(1, n + 1):
            add, delete = previous[j] + 1, current[j - 1] + 1
            change = previous[j - 1]
            if a[j - 1] != b[i - 1]:
                change = change + 1
            current[j] = min(add, delete, change)
    return current[n]

if __name__=="__main__":
    lev = Counter(lev)
    word_file = open('wordfile.txt', 'r')
    for line in word_file:
            a,b = line.split()
            print a,b, lev(a, b)
Ty Bailey
  • 2,392
  • 11
  • 46
  • 79