4

I want to auto-correct the words which are in my list.

Say I have a list

kw = ['tiger','lion','elephant','black cat','dog']

I want to check if these words appeared in my sentence. If they are wrongly spelled I want to correct them. I don't intend to touch other words except from the given list.

Now I have list of str

s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs"]

Expected output:

['tiger','lion',None,'dog']

My Efforts:

import difflib

op = [difflib.get_close_matches(i,kw,cutoff=0.5) for i in s]
print(op)

My Output:

[[], [], [], ['dog']]

The problem with above code is I want to compare entire sentence and my kw list can have more than 1 word(upto 4-5 words).

If I lower the cutoff value it starts returning the words which is should not.

So even if I plan to create bigrams, trigrams from given sentence it would consume a lot of time.

So is there way to implement this?

I have explored few more libraries like autocorrect, hunspell etc. but no success.

Sociopath
  • 13,068
  • 19
  • 47
  • 75

3 Answers3

2

You could implement something based of levenshtein distance.

It's interesting to note elasticsearch's implementation: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzziness.html

Clearly, bieber is a long way from beaver—they are too far apart to be considered a simple misspelling. Damerau observed that 80% of human misspellings have an edit distance of 1. In other words, 80% of misspellings could be corrected with a single edit to the original string.

Elasticsearch supports a maximum edit distance, specified with the fuzziness parameter, of 2.

Of course, the impact that a single edit has on a string depends on the length of the string. Two edits to the word hat can produce mad, so allowing two edits on a string of length 3 is overkill. The fuzziness parameter can be set to AUTO, which results in the following maximum edit distances:

0 for strings of one or two characters

1 for strings of three, four, or five characters

2 for strings of more than five characters

I like to use pyxDamerauLevenshtein myself.

pip install pyxDamerauLevenshtein

So you could do a simple implementation like:

keywords = ['tiger','lion','elephant','black cat','dog']    

from pyxdameraulevenshtein import damerau_levenshtein_distance


def correct_sentence(sentence):
    new_sentence = []
    for word in sentence.split():
        budget = 2
        n = len(word)
        if n < 3:
            budget = 0
        elif 3 <= n < 6:
            budget = 1            
        if budget:            
            for keyword in keywords:        
                if damerau_levenshtein_distance(word, keyword) <= budget:
                    new_sentence.append(keyword)
                    break
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)        
    return " ".join(new_sentence)

Just make sure you use a better tokenizer or this will get messy, but you get the point. Also note that this is unoptimized, and will be really slow with a lot of keywords. You should implement some kind of bucketing to not match all words with all keywords.

PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
2

Here is one way using difflib.SequenceMatcher. The SequenceMatcher class allows you to measure sentence similarity with its ratio method, you only need to provide a suitable threshold in order to keep words with a ratio that falls above the given threshold:

def find_similar_word(s, kw, thr=0.5):
    from difflib import SequenceMatcher
    out = []
    for i in s:
        f = False
        for j in i.split():
            for k in kw:
                if SequenceMatcher(a=j, b=k).ratio() > thr:
                    out.append(k)
                    f = True
                if f:
                    break
            if f:
                break
        else:
            out.append(None)    
    return out

Output

find_similar_word(s, kw)
['tiger', 'lion', None, 'dog'] 
yatu
  • 86,083
  • 12
  • 84
  • 139
  • I am afraid this will be too slow. Actually, I am implementing it for the chatbot so speed matters for me. – Sociopath Apr 30 '19 at 17:18
1

Although this is slightly different from your expected output (it is a list of list instead of a list of string) I thing it is a step in the right direction. The reason I chose this method, is so that you can have multiple corrections per sentence. That is why I added another example sentence.

import difflib
import itertools

kw = ['tiger','lion','elephant','black cat','dog']
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs", "A tyger is different from a doog"]

op = [[difflib.get_close_matches(j,kw,cutoff=0.5) for j in i.split()] for i in s]
op = [list(itertools.chain(*o)) for o in op]

print(op)

The output is generate is:

[['tiger'], ['lion'], [], ['dog'], ['tiger', 'dog']]

The trick is to split all the sentences along the whitespaces.

thomas
  • 624
  • 1
  • 12
  • 27
  • won't work in my case as my `kw` list may contain more than one words and if I split on whitspaces, it won't give correct result. – Sociopath Apr 30 '19 at 17:25