I am trying to write a spellchecker and I wanted to use difflib to implement it. Basically I have a list of technical terms that I added to the standard unix dictionary (/usr/share/dict/words
) that I'm storing in a file I call dictionaryFile.py.
I have another script just called stringSim.py where I import the dictionary and test sample strings against it. Here is the basic code:
import os, sys
import difflib
import time
from dictionaryFile import wordList
inputString = "dictiunary"
print "Search query: "+inputString
startTime = time.time()
inputStringSplit = inputString.split()
for term in inputStringSplit:
termL = term.lower()
print "Search term: "+term
closeMatches = difflib.get_close_matches(termL,wordList)
if closeMatches[0] == termL:
print "Perfect Match"
else:
print "Possible Matches"
print "\n".join(closeMatches)
print time.time() - startTime, "seconds"
It returns the following:
$ python stringSim.py
Search query: dictiunary
Search term: dictiunary
Possible Matches
dictionary
dictionary's
discretionary
0.492614984512 seconds
I'm wondering if there are better strategies I could be using for looking up similar matches (assuming a word is misspelled). This is for a web application so I am trying to optimize this part of the code to be a little snappier. Is there a better way I could structure the wordList variable (right now it is just a list of words)?
Thanks.