0

I am trying to write a spellchecker and I wanted to use difflib to implement it. Basically I have a list of technical terms that I added to the standard unix dictionary (/usr/share/dict/words) that I'm storing in a file I call dictionaryFile.py.

I have another script just called stringSim.py where I import the dictionary and test sample strings against it. Here is the basic code:

import os, sys
import difflib
import time
from dictionaryFile import wordList

inputString = "dictiunary" 
print "Search query: "+inputString  
startTime = time.time()

inputStringSplit = inputString.split()
for term in inputStringSplit:
termL = term.lower()
print "Search term: "+term
closeMatches = difflib.get_close_matches(termL,wordList)
if closeMatches[0] == termL: 
    print "Perfect Match"
else:
    print "Possible Matches"
    print "\n".join(closeMatches)

print time.time() - startTime, "seconds"

It returns the following:

$ python stringSim.py 
Search query: dictiunary
Search term: dictiunary
Possible Matches
dictionary
dictionary's
discretionary
0.492614984512 seconds

I'm wondering if there are better strategies I could be using for looking up similar matches (assuming a word is misspelled). This is for a web application so I am trying to optimize this part of the code to be a little snappier. Is there a better way I could structure the wordList variable (right now it is just a list of words)?

Thanks.

user3058197
  • 1,052
  • 1
  • 9
  • 21
  • 1
    Since you have working code and are looking for improvements, you may have better results on Code Review – wnnmaw May 16 '14 at 17:58

1 Answers1

0

I'm not sure difflib is best solution for this kind of work; typically, spellcheckers use some sort of edit distance, e.g. Levenshtein distance. NLTK includes implementation(s) of edit distance, I'd start there instead.

LetMeSOThat4U
  • 6,470
  • 10
  • 53
  • 93