1

I'm attempting to take large amounts of natural language from a web forum and correct the spelling with PyEnchant. The text is often informal, and about medical issues, so I have created a text file "test.pwl" containing relevant medical words, chat abbreviations, and so on. In some cases, little bits of html, urls, etc do unfortunately remain in it.

My script is designed to use both the en_US dictionary and the PWL to find all misspelled words and correct them to the first suggestion of d.suggest totally automatically. It prints a list of misspelled words, then a list of words that had no suggestions, and writes the corrected text to 'spellfixed.txt':

import enchant
import codecs

def spellcheckfile(filepath):
    d = enchant.DictWithPWL("en_US","test.pwl")
    try:
        f = codecs.open(filepath, "r", "utf-8")
    except IOError:
        print "Error reading the file, right filepath?"
        return
    textdata = f.read()
    mispelled = []
    words = textdata.split()
    for word in words:
        # if spell check failed and the word is also not in
        # mis-spelled list already, then add the word
        if d.check(word) == False and word not in mispelled:
            mispelled.append(word)
    print mispelled
    for mspellword in mispelled:
        #get suggestions
        suggestions=d.suggest(mspellword)
        #make sure we actually got some
        if len(suggestions) > 0:
            # pick the first one
            picksuggestion=suggestions[0]
        else: print mspellword
        #replace every occurence of the bad word with the suggestion
        #this is almost certainly a bad idea :)
        textdata = textdata.replace(mspellword,picksuggestion)
    try:
        fo=open("spellfixed.txt","w")
    except IOError:
        print "Error writing spellfixed.txt to current directory. Who knows why."
        return 
    fo.write(textdata.encode("UTF-8"))
    fo.close()
    return

The issue is that the output often contains 'corrections' for words that were in either the dictionary or the pwl. For instance, when the first portion of the input was:

My NEW doctor feels that I am now bi-polar . This , after 9 years of being considered majorly depressed by everyone else

I got this:

My NEW dotor feels that I am now bipolar . This , aftER 9 years of being considERed majorly depressed by evERyone else

I could handle the case changes, but doctor --> dotor is no good at all. When the input is much shorter (for example, the above quotation is the entire imput), the result is desirable:

My NEW doctor feels that I am now bipolar . This , after 9 years of being considered majorly depressed by everyone else

Could anybody explain to me why? In very simple terms, please, as I'm very new to programming and newer to Python. A step-by-step solution would be greatly appreciated.

user2437842
  • 139
  • 1
  • 10

2 Answers2

1

I think your problem is that you're replacing letter sequences inside words. "ER" might be a valid spelling correction for "er", but that doesn't mean that you should change "considered" to "considERed".

You can use regexes instead of simple text replacement to ensure that you replace only full words. "\b" in a regex means "word boundary":

>>> "considered at the er".replace( "er", "ER" )
'considERed at the ER'
>>> import re
>>> re.sub( "\\b" + "er" + "\\b", "ER", "considered at the er" )
'considered at the ER'
svk
  • 5,854
  • 17
  • 22
  • Thanks mate. I do know regexes, but being so new to programming and Python, I would have no idea how to implement the word boundary delimiters within my code. ... Clue? – user2437842 Jan 16 '14 at 13:00
  • Am I getting somewhere by doing: textdata = textdata.replace("\\b" + mspellword + "\\b","\\b" + picksuggestion + "\\b") – user2437842 Jan 16 '14 at 13:47
  • @user2437842 Not quite, you need to do use a regex function like `re.sub` instead of string `replace`. See the code snipped in my answer as well as [documentation](http://docs.python.org/2/library/re.html#re.sub). You can construct the regex as `"\\b" + re.escape( mspellword ) + "\\b"`. The text you want to insert as a replacement (`picksuggestion`) should not be converted into a regex. – svk Jan 16 '14 at 14:08
1
    #replace every occurence of the bad word with the suggestion
    #this is almost certainly a bad idea :)

You were right, that is a bad idea. This is what's causing "considered" to be replaced by "considERed". Also, you're doing a replacement even when you don't find a suggestion. Move the replacement to the if len(suggestions) > 0 block.

As for replacing every instance of the word, what you want to do instead is save the positions of the misspelled words along with the text of the misspelled words (or maybe just the positions and you can look the words up in the text later when you're looking for suggestions), allow duplicate misspelled words, and only replace the individual word with its suggestion.

I'll leave the implementation details and optimizations up to you, though. A step-by-step solution won't help you learn as much.

Eric Finn
  • 8,629
  • 3
  • 33
  • 42
  • I appreciate your wisdom and spirit. Thanks. I wish my dad were a bit more like you somtimes. That said, I'm afraid what you've described is totally out of my reach. Saving word positions and allowing duplicates are things I've never even dreamt of. I'm happy to 'work for it', but I think I need a push start. Be a pal? – user2437842 Jan 16 '14 at 13:02
  • @user2437842 Heh, fair enough. Look into tuples. A tuple of `(word, position)` should work. – Eric Finn Jan 16 '14 at 13:39