Looking for similar words

Question

I'm trying to write a spellchecker module.

It loads a text, creates a dictionary from 16 mb file and then checks if encountered word is similar to the word in dictionary (similar = varies up to two chars) if so then it changes it to the form from dictionary.

Right now I'm using a Levenshtein Distance algorithm and processing of a 50 words set takes 3 min...

I'm pretty sure that there must be a faster solution. Profiler told me that my app spends more than 80% of it's time in Levenshtein Distance function.

Are there any better solutions/algorithms?

Here is the implemented of version of the algorithm I use:

def levenshteinDistance(s1, s2):
    l_s1 = len(s1)
    l_s2 = len(s2)
    d = [[a for a in genM(x, l_s2 + 1)] for x in xrange(l_s1 + 1)]
    for i in xrange(1, l_s1 + 1):
        for j in xrange(1, l_s2 + 1):
            d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + decide_of_equality(s1[i - 1],s2[j - 1]))
    return d[l_s1][l_s2]

Sounds mere like "Autocorrect" than spellcheck, since spellcheckers usually create options and let users select from among them. Autocorrect is pretty obviously impossible to do well, a fact now almost universally acknowledged, even on TV commercials. :-) — Warren P, Apr 08 '12 at 19:33
If you make the assumption that the first letter of the word is always correct, then you can just check the dictionary for words that start with that letter. It will decrease your time by more or less a factor or 26 — Doboy, Apr 08 '12 at 19:33
I don't know much about python, but your distance function uses the standard dynamic programming solution. Here is my version in C++: http://codereview.stackexchange.com/questions/10130/edit-distance-between-two-strings maybe you can spot some difference. — Andrew Tomazos, Apr 08 '12 at 19:35
Nevertheless, Google does this like this: http://norvig.com/spell-correct.html — Warren P, Apr 08 '12 at 19:35
@WarrenP I was just about to post that link. It is a beautiful solution. I've used several of the techniques in it for many different kinds of word transformations. — Nolen Royalty, Apr 08 '12 at 19:36
Thanks for your answers guys I'm looking at those solutions right now. — Michal, Apr 08 '12 at 19:49
@JoelCornett it's a generator that helps with creating "masks" like d = [[1,2,3,4],[1,0,0,0],[2,0,0,0],[3,0,0,0],[4,0,0,0]] — Michal, Apr 08 '12 at 21:48
@Wysek: Oh I see. I used a lambda function to evaluate the variable in the `a` position instead. I'm curious to know which one is more efficient... — Joel Cornett, Apr 08 '12 at 21:51
@JoelCornett: I can say that, from my experience, generators are way much faster than lambdas in Python. Even Guido encourages using generators instead of list comprehensions and map/lambda combination. But as usual sometimes lambda just fits but in this case I would say that generator fits better. If you do some kind of benchmark please let me know, because I'm curious too. — Michal, Apr 09 '12 at 15:55
@Wysek: Will do. I just thought of something else. Have you considered using an `array` instead of a list for `d`. It _may_ be more efficient because `d` doesn't get resized at any time during the operation. — Joel Cornett, Apr 09 '12 at 18:40
@JoelCornett Sorry for the late comment, an array is indeed much better choice for the job. — Michal, May 04 '12 at 19:51

score 2 · Accepted Answer · answered Apr 08 '12 at 20:15

I have used Norvig's spell corrector, mentioned in the comments and it is awesome.

However coming to your problem, you have written a dynamic programming edit distance Algorithm. Your algorithm qualifies to be a data parallel algorithm. On a shared memory i.e on a single machine if you have multi cores you can exploit them. Do you know something called map-reduce? Please dont think distributed and all right now, just consider one single quad core machine and a shared memory. As a step 1 you can partition your dictionary and allocate a portion to each thread which will run edit distance on a portion of dictionary (similar to a map step). Later all your threads will return you all the words at an edit distance of 2 (similar to reduce step). This way your program will benefit from multi core architecture.

Another thing I could think of is inside your python code write the cpu intensive edit distance algorithm in C i.e by writing a python extension.

Unfortunately I'mnot allowed to use multiple cores, but Norvig's solution did the trick. — Michal, Apr 08 '12 at 20:58

score 0 · Answer 2 · answered Apr 08 '12 at 21:06

0

Maybe the problem is at a higher level. When a profiler tells you that a lot of time is spent in a function, it might be that you're calling it too often. Are you perhaps comparing each word in the text to each word in the dictionary? Try it the other way around: for words in the text, directly generate words of distance <= 2 and check if they're in the dictionary.

answered Apr 08 '12 at 21:06

Karl Knechtel

62,466
11
102
153

You are right that sometimes problem lays in too many calls, but that's not my case. I can use only words from a dictionary, that's why I don't need to generate new words but instead I can use words from the dictionary which are of distance <= 2 from my encountered word. But you pointed out some good stuff for other cases. – Michal Apr 09 '12 at 16:01

Looking for similar words

2 Answers2