Most commonly misspelled English words are within two or three typographic errors (a combination of substitutions s, insertions i, or letter deletions d) from their correct form. I.e. errors in the word pair absence - absense
can be summarized as having 1 s, 0 i and 0 d.
One can fuzzy match to find words and their misspellings using the to-replace-re regex python module.
The following table summarizes attempts made to fuzzy segment a word of interest from some sentence:
- Regex1 finds the best
word
match insentence
allowing at most 2 errors - Regex2 finds the best
word
match insentence
allowing at most 2 errors while trying to operate only on (I think) whole words - Regex3 finds the best
word
match insentence
allowing at most 2 errors while operating only on (I think) whole words. I'm wrong somehow. - Regex4 finds the best
word
match insentence
allowing at most 2 errors while (I think) looking for the end of the match to be a word boundary
How would I write a regex expression that eliminates, if possible, false positive and false negative fuzzy matches on these word-sentence pairs?
A possible solution would be to only compare words (strings of characters surrounded by white space or the beginning/end of a line) in the sentence to the word of interest (principal word). If there's a fuzzy match (e<=2) between the principal word and a word in the sentence, then return that full word (and only that word) from the sentence.
Code
Copy the following dataframe to your clipboard:
word sentence
0 cub cadet cub cadet 42
1 plastex vinyl panels
2 spt heat and air conditioner
3 closetmaid closetmaid
4 ryobi batteries kyobi
5 ryobi 10' table saw ryobi
6 trafficmaster traffic mast5er
Now use
import pandas as pd, regex
df=pd.read_clipboard(sep='\s\s+')
test=df
test['(?b)(?:WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?b)(?:\wWORD\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:\w'+x['word']+'\W){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:\w&&WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:\w&&'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:WORD&&\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:'+x['word']+'&&\W){e<=2}', x['sentence']),axis=1)
To load the table into your environment.