0

I'm having trouble searching for the right terms here to solve the below problem; I'm sure it's a done thing, I just can't find the right terms to express the problem!

I'm basically trying to create a classifier that will take word comparison outputs (e.g. some outputs from Levenstein distances) and decide whether the words are sufficiently different. An important input would probably be something like a soundex comparison. The trouble I'm having is creating the training set for the algorithm (an SVM in this case). I have a long list of names and I need to mutate them a bit (based on similar sounds within the word).

E.g. John and Jon would be a mutation to make, and I could label this in the test set as being equivalent. John and Johann have sufficiently different sound and letter distance to be considered different.

So I'm kinda asking for is a way to achieve a phoneme variation generator, but need to be able to retain the English lettering structure.

Even simple translation might suffice, like "f" could (sometimes) be replaced by "ph". I'm doing this in Java so any tips in that direction would be great too! Thanks.

EDIT

This is the closest I've come across so far: http://www.isi.edu/natural-language/people/hovy/papers/07IJCAI-spelling-variants.pdf

Manish Patel
  • 4,411
  • 4
  • 25
  • 48
  • have you attempted just a simple use of edit distance? note that ed(john,jon)=1 whereas ed(john,johann)=2 – Debasis Aug 13 '14 at 15:32
  • thanks @Debasis, sorry what do you mean exactly? I'm trying to generate variations rather than find out the edit distance – Manish Patel Aug 13 '14 at 15:46
  • sorry i misunderstood your question... i thought u were intending to compute phonetic distances... that's why i suggested to use edit distances as "crude" approximations to real phonetic distances... – Debasis Aug 13 '14 at 18:38

1 Answers1

1

I'm just thinking aloud.

Rule-based: Apply a rule-based system where you could use standard substitution rules such as 'ph' for 'f', and insertion rules such as insert an h between a vowel and a consonant.

Character n-gram alignment: Use a word alignment tool such as Giza++ to align character n-grams from parallel corpora such as Europarl. I guess you would be able to find interesting word spelling variations such as "house", "haus" etc. You can play with various values of n.

Bootstraping character n-gram alignment with rule-based: You might also want to use a combination of the two, in which you could, in principle, boost the probabilities of some alignments by using a set of external rules and heuristics.

Debasis
  • 3,680
  • 1
  • 20
  • 23
  • +1 for n-gram alignment, looks promising. I was hoping this was a "done" thing and that there are some well defined algorithms to do this but it's not looking positive at the moment. Will report back. – Manish Patel Aug 13 '14 at 20:49