auto-correct the words from the list in python

Question

I want to auto-correct the words which are in my list.

Say I have a list

kw = ['tiger','lion','elephant','black cat','dog']

I want to check if these words appeared in my sentence. If they are wrongly spelled I want to correct them. I don't intend to touch other words except from the given list.

Now I have list of str

s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs"]

Expected output:

['tiger','lion',None,'dog']

My Efforts:

import difflib

op = [difflib.get_close_matches(i,kw,cutoff=0.5) for i in s]
print(op)

My Output:

[[], [], [], ['dog']]

The problem with above code is I want to compare entire sentence and my kw list can have more than 1 word(upto 4-5 words).

If I lower the cutoff value it starts returning the words which is should not.

So even if I plan to create bigrams, trigrams from given sentence it would consume a lot of time.

So is there way to implement this?

I have explored few more libraries like autocorrect, hunspell etc. but no success.

PascalVKooten · Answer 1 · 2019-04-30T13:34:59.157

You could implement something based of levenshtein distance.

It's interesting to note elasticsearch's implementation: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzziness.html

Clearly, bieber is a long way from beaver—they are too far apart to be considered a simple misspelling. Damerau observed that 80% of human misspellings have an edit distance of 1. In other words, 80% of misspellings could be corrected with a single edit to the original string.

Elasticsearch supports a maximum edit distance, specified with the fuzziness parameter, of 2.

Of course, the impact that a single edit has on a string depends on the length of the string. Two edits to the word hat can produce mad, so allowing two edits on a string of length 3 is overkill. The fuzziness parameter can be set to AUTO, which results in the following maximum edit distances:

0 for strings of one or two characters

1 for strings of three, four, or five characters

2 for strings of more than five characters

I like to use pyxDamerauLevenshtein myself.

pip install pyxDamerauLevenshtein

So you could do a simple implementation like:

keywords = ['tiger','lion','elephant','black cat','dog']    

from pyxdameraulevenshtein import damerau_levenshtein_distance


def correct_sentence(sentence):
    new_sentence = []
    for word in sentence.split():
        budget = 2
        n = len(word)
        if n < 3:
            budget = 0
        elif 3 <= n < 6:
            budget = 1            
        if budget:            
            for keyword in keywords:        
                if damerau_levenshtein_distance(word, keyword) <= budget:
                    new_sentence.append(keyword)
                    break
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)        
    return " ".join(new_sentence)

Just make sure you use a better tokenizer or this will get messy, but you get the point. Also note that this is unoptimized, and will be really slow with a lot of keywords. You should implement some kind of bucketing to not match all words with all keywords.

thanks for the useful info. will check the performance on my real data. — Sociopath, Apr 30 '19 at 17:12

yatu · Answer 2 · 2019-04-30T13:36:53.730

Here is one way using difflib.SequenceMatcher. The SequenceMatcher class allows you to measure sentence similarity with its ratio method, you only need to provide a suitable threshold in order to keep words with a ratio that falls above the given threshold:

def find_similar_word(s, kw, thr=0.5):
    from difflib import SequenceMatcher
    out = []
    for i in s:
        f = False
        for j in i.split():
            for k in kw:
                if SequenceMatcher(a=j, b=k).ratio() > thr:
                    out.append(k)
                    f = True
                if f:
                    break
            if f:
                break
        else:
            out.append(None)    
    return out

Output

find_similar_word(s, kw)
['tiger', 'lion', None, 'dog']

I am afraid this will be too slow. Actually, I am implementing it for the chatbot so speed matters for me. — Sociopath, Apr 30 '19 at 17:18

score 1 · Answer 3 · answered Apr 30 '19 at 13:34

Although this is slightly different from your expected output (it is a list of list instead of a list of string) I thing it is a step in the right direction. The reason I chose this method, is so that you can have multiple corrections per sentence. That is why I added another example sentence.

import difflib
import itertools

kw = ['tiger','lion','elephant','black cat','dog']
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs", "A tyger is different from a doog"]

op = [[difflib.get_close_matches(j,kw,cutoff=0.5) for j in i.split()] for i in s]
op = [list(itertools.chain(*o)) for o in op]

print(op)

The output is generate is:

[['tiger'], ['lion'], [], ['dog'], ['tiger', 'dog']]

The trick is to split all the sentences along the whitespaces.

won't work in my case as my `kw` list may contain more than one words and if I split on whitspaces, it won't give correct result. — Sociopath, Apr 30 '19 at 17:25

auto-correct the words from the list in python

3 Answers3