3

When you misspell a word in Google ("appples" for example), it comes up with the now familiar, "Did you mean: apples" suggestion for you.

Excluding Google's ability to guess your intentions based on relevance of search results, how can I develop a list of words that sound the same?

The words don't have to be English and also do not have to exist. So, for example, if I give the input "hole", I would get back a list including words like: "whole" "hola" "whore" "role" "molar", etc...

I am guessing there might be something online that can develop this list, but I couldn't find anything. If there is not a site and if it can be done using Perl, is there a CPAN module that can help me do this?

CheeseConQueso
  • 5,831
  • 29
  • 93
  • 126
  • 1
    If you can break the words into phonemes then it becomes a most common substrings problem. Breaking words into phonemes is a seriously hard problem though. – Flexo Feb 01 '12 at 21:07

2 Answers2

5

If you are truly looking for words that sound the same, and not just search suggestions - you can look at phonetic algorithms. Soundex and Metaphone/Double Metaphone are two very common ones and there are implementations of each in any popular language.

These algorithms reduce a word down to a "key" that indicates its pronunciation. If you took a corpus of words to start and built a data structure mapping these keys to words that evaluate to them- you could take an arbitrary string, evaluate it down to its "key" and then look up other words that evaluate to the same key in your data structure (probably a hash table of lists or similar).

This isn't perfect, because you'd need to find a big corpus of words to seed your dataset with, but it would work.

On the other hand, if you simply want search suggestions/alternate spellings there are easier ways to go about it.

Hope that was helpful.

acoffman
  • 708
  • 3
  • 10
  • thanks for the lead on Soundex... i actually found that it's a function of Oracle and PHP and probably many other languages. I don't understand the data it returns though. See the examples here http://www.techonthenet.com/oracle/functions/soundex.php "apples" returns "A142" and "applus" returns "A142" also. What does "A142" mean? – CheeseConQueso Feb 01 '12 at 21:21
  • 1
    @CheeseConQueso The combination of letters and numbers that are returned don't necessarily have meaning unto themselves - what the algorithm does is reduce words down into those keys, so two words that evaluate to the same key have similar pronunciations. That's why in order to do what you're suggesting using a phonetic algorithm, you'd have to build a searchable datastore of key -> (list of words that evaluate to that key), so when you get "apples" you run it through your algorithm - get "A142" and then search your datastore for words that also evaluate to "A142". That help? – acoffman Feb 01 '12 at 21:26
  • oh... any idea where to find a soundex table that I can import into a table on my DB? If not, what kind of keywords should I be feeding google to find more info? Thanks for your help – CheeseConQueso Feb 01 '12 at 22:23
  • That's something I'm not 100% sure of. When we used it at work, we found several english language word lists and dictionaries and built our own mapping – acoffman Feb 01 '12 at 22:26
2

You can start by learning about the module Text::Soundex . It is a simple algorithm that maps words to 4 byte codes. I got Soundex out of Sedgewick (ex Knuth) long ago, used it to generate longer keys (not truncated) and suggested lists of corrections for 0 and 1-letter substitutions. I applied this to large databases of census and postal data.

Erik Olson
  • 1,154
  • 8
  • 18