How to generate homophones on substring level?

Question

I want to generate homophones of words programmatically. Meaning, words that sound similar to the original words.

I've come across the Soundex algorithm, but it just replaces some characters with other characters (like t instead of d). Are there any lists or algorithms that are a little bit more sophisticated, providing at least homophone substrings?

Important: I want to apply this on words that aren't in dictionaries, meaning that I can't rely on whole, real words.

EDIT:

The input is a string which is often a proper name and therefore in no standard (homophone) dictionary. An example could be Google or McDonald's (just to name two popular named entities, but many are much more unpopular).

The output is then a (random) homophone of this string. Since words often have more than one homophone, a single (random) one is my goal. In the case of Google, a homophone could be gugel, or MacDonald's for McDonald's.

Pedantry: technically, MacDonald's is a proper name too, but I suspect "hardies" for Hardee's (another fast food chain) or "heralds" for Harold's (Chicken Shack) is along the lines of what you want. — aschultz, Jul 29 '19 at 03:12

score 1 · Answer 1 · answered Nov 17 '17 at 23:57

How to do this well is a research topic. See for example http://www.inf.ufpr.br/didonet/articles/2014_FPSS.pdf.

But suppose that you want to roll your own.

The first step is figuring out how to turn the letters that you are given into a representation of what it sounds like. This is a very hard problem with guessing required. (eg What sound does "read" make? Depends on whether you are going to read, or you already read!) However text to phonemes converter suggests that Arabet has solved this for English.

Next you'll want this to have been done for every word in a dictionary. Assuming that you can do that for one word, that's just a script.

Then you'll want it stored in a data structure where you can easily find similar sounds. That is in principle no difference than the sort of algorithms that are used for autocorrect for spelling. Only with phonemes instead of letters. You can get a sense of how to do that with http://norvig.com/spell-correct.html. Or try to implement something like what is described in http://fastss.csg.uzh.ch/ifi-2007.02.pdf.

And that is it.

Where I see the problem is that my dictionary would not contain words like macdonald's or gugel - and therefore wouldn't be considered to be homophones, am I right? To build the dictionary I would need to know the possible homophones beforehand. It's different from autocorrection, because I want to go from Google to Gugel instead of Gugel to Google. — ScientiaEtVeritas, Nov 18 '17 at 00:14
Right. You would need all of the possible answers to have been available ahead for this approach. — btilly, Nov 18 '17 at 00:24

How to generate homophones on substring level?

1 Answers1