9

Most commonly misspelled English words are within two or three typographic errors (a combination of substitutions s, insertions i, or letter deletions d) from their correct form. I.e. errors in the word pair absence - absense can be summarized as having 1 s, 0 i and 0 d.

One can fuzzy match to find words and their misspellings using the to-replace-re regex python module.

The following table summarizes attempts made to fuzzy segment a word of interest from some sentence:

enter image description here

  • Regex1 finds the best word match in sentence allowing at most 2 errors
  • Regex2 finds the best word match in sentence allowing at most 2 errors while trying to operate only on (I think) whole words
  • Regex3 finds the best word match in sentence allowing at most 2 errors while operating only on (I think) whole words. I'm wrong somehow.
  • Regex4 finds the best word match in sentence allowing at most 2 errors while (I think) looking for the end of the match to be a word boundary

How would I write a regex expression that eliminates, if possible, false positive and false negative fuzzy matches on these word-sentence pairs?

A possible solution would be to only compare words (strings of characters surrounded by white space or the beginning/end of a line) in the sentence to the word of interest (principal word). If there's a fuzzy match (e<=2) between the principal word and a word in the sentence, then return that full word (and only that word) from the sentence.

Code

Copy the following dataframe to your clipboard:

            word                  sentence
0      cub cadet              cub cadet 42
1        plastex              vinyl panels
2            spt  heat and air conditioner
3     closetmaid                closetmaid
4          ryobi           batteries kyobi
5          ryobi       10' table saw ryobi
6  trafficmaster           traffic mast5er

Now use

import pandas as pd, regex
df=pd.read_clipboard(sep='\s\s+')

test=df
test['(?b)(?:WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?b)(?:\wWORD\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:\w'+x['word']+'\W){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:\w&&WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:\w&&'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:WORD&&\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:'+x['word']+'&&\W){e<=2}', x['sentence']),axis=1)

To load the table into your environment.

Community
  • 1
  • 1
zelusp
  • 3,500
  • 3
  • 31
  • 65
  • Are these inline modifiers `(?V1)`, `(?b)` and what do they mean? –  Apr 25 '16 at 23:24
  • How do you compare a _fuzzy_ word with a real word? Are you using a dictionary of some kind? The easiest way is to split on whitespace and use a custom ternary tree that you write of all the words in a dictionary. As you traverse the tree, you could allow for _N_ letters out of place. You'd need special branching code. –  Apr 25 '16 at 23:27
  • @sln: he is speaking about this module: https://pypi.python.org/pypi/regex – Casimir et Hippolyte Apr 25 '16 at 23:32
  • Yes, `(?V1)` forces regex to use version 1 (instead of 0) of its string matching behavior. `(?b)` is the BESTMATCH flag. It's partially in the linked docs. The *fuzzy* word is simply any character sequence that is within 2 errors of the *principal word* as defined above :) – zelusp Apr 25 '16 at 23:32
  • @CasimiretHippolyte - Yeah, that's what I thought. That module has got too much to comprehend. Even I don't want to parse it's constructs. –  Apr 25 '16 at 23:38
  • `(?V1)` is used to enable set operators like `&&` – zelusp Apr 25 '16 at 23:40
  • 2
    @zelusp - If that `regex` module can do fuzzy, one or two letters missing. Make a regex out of a dictionary (or the words you're interested in). This _[HERE](http://www.regexformat.com/version_files/Rx5_ScrnSht01.jpg)_ has got this tool (functional in the trial version) that does just that. It creates a ternary tree out of dictionary words or strings. Here is a [175,000](http://www.regexformat.com/Dnl/_Samples/_Ternary_Tool%20(Dictionary)/___txt/_ASCII_175,000_word_Mix_A-Z_Multi_Lined.txt) word one. –  Apr 25 '16 at 23:45
  • 1
    @zelusp: The && in those regexes isn't a set operator, it's just a literal. Set operators occur only in character sets, between "[" and "]". – MRAB Apr 25 '16 at 23:46
  • regexformat's strings to regex looks awesome! I definitely have an application for that. Here, though, something tells me a solution is a tweak to my regular expressions (as @MRAB points out - thanks!). – zelusp Apr 26 '16 at 00:11
  • Do you want to write a spell checker? – jfs Apr 26 '16 at 06:07
  • The ultimate goal is to count how many times a word (perhaps misspelled) is mentioned in some arbitrary length string. – zelusp Apr 26 '16 at 15:29
  • 1
    \w matches 1 word character and \W matches 1 non-word character. They aren't word-boundary checks. \b is the word-boundary check. There's also \m for the start-of-word and \M for the end-of-word boundary checks. You'll probably want to put them outside the fuzzy bit if you want to ensure that they are enforced. – MRAB Apr 26 '16 at 20:19
  • @MRAB, `'(?b)\m(?:WORD){e<=2}\M'` hits the nail on the head (and no where else). If you want to answer, I'll count it. Grazie! – zelusp Apr 26 '16 at 21:10

1 Answers1

3

Do '(?b)\m(?:WORD){e<=2}\M'

zelusp
  • 3,500
  • 3
  • 31
  • 65