3

I am designing a text processing program that will generate a list of keywords from a long itemized text document, and combine entries for words that are similar in meaning. There are metrics out there, however I have a new issue of dealing with words that are not in the dictionary that I am using.

I am currently using nltk and python, but my issues here are of a much more abstracted nature. Given a word that is not in a dictionary, what would be an efficient way of resolving it to a word that is within your dictionary? My only current solution involves running through the words in the dictionary and picking the word with the shortest Levenshtein distance (editing distance) from the inputted word.

Obviously this is a very slow and impractical method, and I don't actually need the absolute best match from within the dictionary, just so long as it is a contained word and it is pretty close. Efficiency is more important for me in the solution, but a basic level of accuracy would also be needed.

Any ideas on how to generally resolve some unknown word to a known one in a dictionary?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
  • Check out [BK Trees](http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees) or [Levenshtein automata](http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata). – Michael Mior Jun 13 '12 at 18:51
  • While those are both very cool, and definitely on the right track, I'm looking for something that will not reject any words as in Levenshtein Automata, and doesn't require that large amount of storage that BK Trees do. – Slater Victoroff Jun 13 '12 at 19:02
  • 1
    Does this mean you want to find a match for words with an arbitrary distance away from your dictionary? (e.g. looking for `the` in `supercalifragilisticexpiladocious`, `pneumonoultramicroscopicsilicovolcanoconiosis` should still give a result?) – Michael Mior Jun 13 '12 at 19:06
  • As silly as it sounds, yes. I may well be dealing with words common within my working set that are at very far distances from my dictionary, and I would like them to resolve to the same word, though exactly which word they resolve to is not very important. Is there any way to work backwards from an aggressive stemmer link lancaster? – Slater Victoroff Jun 13 '12 at 19:09
  • Is it important that you get a meaningful result, or simply that a mapping exists and is consistent? That is, could you use Levenshtein distance up to some limiting value and then an arbitrary mapping after that? Perhaps when constructing a Levenshtein automata, you could truncate the input dictionary? – Michael Mior Jun 13 '12 at 19:43
  • A meaningful result would be nice, but is definitely secondary to a consistent mapping that exists. I may be misinterpreting your answer, but how would I intelligently truncate the input dictionary, or limit the Levenshtein distance without actually calculating each distance? – Slater Victoroff Jun 13 '12 at 19:51
  • I realized that what I was thinking in my head won't actually work, so unfortunately, it makes sense that my comment would be confusing. Apologies. – Michael Mior Jun 13 '12 at 20:03
  • Heh, well thanks for thinking about it at least. – Slater Victoroff Jun 13 '12 at 20:13
  • 1
    If you are looking for words with similar meanings, then you really need a Thesaurus rather than Levenshtein distance. Alternatively, Soundex might be faster then Levenshtein, at least for the first pass. – rossum Jun 14 '12 at 11:22
  • Soundex sounds like a good start, but synonym finding is a lot more computationally intensive than distance finding. Also, is there anything about Soundex dictionary lookups that would make them faster than a standard lookup? My intuition says that the speed lost through processing everything through Soundex, and mapping it back, would be more than the speed gained by the slightly smaller number of search terms, but if the lookup is faster it would definitely make up for it. – Slater Victoroff Jun 14 '12 at 13:46
  • Soundex: Whenever a word is stored, store its Soundex as a secondary index. If a new word doesn't match, retrieve all the words with the same Soundex and go through them looking at Levenshtein distance. Not perfect, but likely to be a lot quicker than doing the Levenshtein distance for every word in the dictionary. Soundex lets you reduce the number of words that need heavy duty calculation. – rossum Jun 14 '12 at 16:24

4 Answers4

1

Looks like you need a spelling corrector to match words in your dictionary. The code below works and taken directly from this blog http://norvig.com/spell-correct.html written by Peter Norvig,

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
    splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes    = [a + b[1:] for a, b in splits if b]
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
    replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
    inserts    = [a + c + b     for a, b in splits for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

big.txt is your dictionary containing known words.

Salil Navgire
  • 191
  • 1
  • 1
  • 10
0

Your task sounds like it's really just non-word spelling correction, so a relatively straight-forward solution would be to use an existing spell checker like aspell with a custom dictionary.

A quick and dirty approach would be to just use a phonetic mapping like metaphone (which is one the algorithms used by aspell). For each possible code derived from your dictionary, choose a representative word (e.g., the most frequent word in the group) to suggest as the correction, and pick a default correction for the case where no matches are found. But you'd probably get better results using aspell.

If you do want to calculate edit distances, you can do it relatively quickly by storing the dictionary and possible edit operations in tries, see Brill and Moore (2000). If you have a decent-sized corpus of spelling errors and their corrections and can implement Brill and Moore's whole approach, you would probably beat aspell by quite a bit, but it sounds like aspell (or any spell checker that lets you create your own dictionary) is sufficient for your task.

aab
  • 10,858
  • 22
  • 38
  • My task is very different from spelling corrections. I picked Levenshtein distance just because it was a consistent mapping that would be likely to preserve some meaning. I will also be dealing with some symbolic writing, and so a spell checker would simply not work. – Slater Victoroff Jun 14 '12 at 13:32
  • Can you provide an example? I don't know what you mean by "symbolic writing". If you have some definition of "word" and are comparing an unknown word to a list of words in a dictionary (both of which hopefully are use the same finite set of symbols), I don't see why spell checking algorithms would not be helpful. – aab Jun 14 '12 at 14:18
  • symbolic writing such as "a^2 + 2b + c", a spell checking algorithm would be useful, but only for a small subset of the cases I would encounter, also I need to be able to map words that are simply not in the dictionary. – Slater Victoroff Jun 14 '12 at 15:15
  • Well, either you have a dictionary that lists all the valid words/tokens or you have a much more complicated problem. Non-word spell checking always maps words that are not in the dictionary to words that are in the dictionary. What would you would want to map "a^2 + 2b + c" to in your dictionary? I think you need to provide a better description of your problem or more examples... – aab Jun 18 '12 at 09:18
  • I am aware that it is a more complicated problem than spell checking, which is why I said that. What "a^2 + 2b + c" gets mapped to is immaterial so long as it is consistent. All this has been posted already, you need only to read it. – Slater Victoroff Jun 28 '12 at 04:01
  • You say in your question: "Any ideas on how to generally resolve some unknown word to a known one in a dictionary?" This task is non-word spell checking. That may not be what you want to do, but then you need to edit your question. – aab Jul 03 '12 at 13:20
0

Hopefully this answer is not too vague:

1) It sounds like you might need to look at your tokenisation and phrase chunking layers first. This is is where you should discard symbolic phrase chunks before submitting them to any fuzzy spell checking.

2) I would still recommend edit distance to come up with alternatives to any 'mis-spelt' tokens after that, but this may return a list of equally close possibles.

3)When you have your list of possibles, you could then use co-occurence algorithms to select the most likey candidate from this list. I only have a java example of some software that could help ( http://www.linguatools.de/disco/disco_en.html#was ) . You can submit a word, and this will return the difinitive co-occuring words for that word. You can then compare this list to the context of your 'mis-spelt' word, and select the one with the most overlap from all potential edit distance words.

Laurence
  • 1,556
  • 10
  • 13
  • Also, have you looked at non-dictionary based methods (such as the DISCO software) for finding similar words in the first place.? – Laurence Mar 07 '13 at 15:01
0

I do not see a reason to use Levenshtein distance to find a word similar in meaning. LD looks at form (you want to map "bus" to "truck" not to "bush").

The correct solution depends on what you want to do next.

Unless you really need the information in those unknown words, I would simply map all of them to a single generic "UNKNOWN_WORD" item.

Obviously you can cluster the unknown words by their context and other features (say, do they start by a capital letter). For context clustering: since you are interested in meaning, I would use a larger window for those words (say +/- 50 words) and probably use a simple bag of words model. Then you simply find a known word whose vector in this space is closest to the unknown word using some distance metrics (say, cosine). Let me know if you need more information about this.

Jirka
  • 4,184
  • 30
  • 40