Machine learning challenge: learn english pronunciation

Question

Say you want to take CMU's phonetic data set input that looks like this:

ABERRATION  AE2 B ER0 EY1 SH AH0 N
ABERRATIONAL  AE2 B ER0 EY1 SH AH0 N AH0 L
ABERRATIONS  AE2 B ER0 EY1 SH AH0 N Z
ABERT  AE1 B ER0 T
ABET  AH0 B EH1 T
ABETTED  AH0 B EH1 T IH0 D
ABETTING  AH0 B EH1 T IH0 NG
ABEX  EY1 B EH0 K S
ABEYANCE  AH0 B EY1 AH0 N S

(The word is to the left, to the right are a series of phonemes, key here)

And you want to use it as training data for a machine learning system that would take new words and guess how they would be pronounced in English.

It's not so obvious to me at least because there isn't a fixed token size of letters which could possible map to a phoneme. I have a feeling that something to do with a markov chain might be the right way to go.

How would you do this?

One thing to keep in mind is that both the CMU and moby data are for American pronunciation and don't have a very good set of phonemes for British or other English varieties. In fact even the CMU and moby data have different sets of phonemes. The moby pronunciator is here: http://icon.shef.ac.uk/Moby/mpron.html — hippietrail, May 09 '11 at 04:38

score 6 · Accepted Answer · answered Apr 05 '09 at 23:07

6

The problem is called Grapheme-to-phoneme conversion, a subproblem of Natural Language Processing. Google brings up a few papers.

answered Apr 05 '09 at 23:07

Frank

64,140
93
237
324

score 2 · Answer 2 · answered Mar 23 '09 at 14:58

2

Not entirely my field, but maybe build a neural network with several layers - earlier layers to guess the splitting of the words into sequential syllables, the later layers to guess the pronounciation of the said syllables.

Setting up a ANFIS-learning neural network is fairly straightforward for numerical data, for literal/phonetic data the task is undoubtedly several orders more complex.

answered Mar 23 '09 at 14:58

Jukka Dahlbom

1,740
2
16
23

can you really have an NN with a variable number of output nodes? – ʞɔıu Mar 24 '09 at 01:50
I believe so - quick googling suggests it is easier to train the networks separately and then combine to achieve several outputs. This problem is far from trivial, and I don't claim to being able to really solve it. – Jukka Dahlbom Mar 24 '09 at 08:22
Would you really need a variable number of output nodes? Unless the number of phonemes is prohibitively large, just have as many output nodes as possible phonemes. – bubaker May 20 '09 at 01:53

Machine learning challenge: learn english pronunciation

2 Answers2