I want to do the embedded training for speech recognition. In the beginning, I want to use the monophone with 3-states, as the paper decripted, I can concatenate all the phones in one word or sentence to make a composited hmm model, and do embedded training on the composited hmm model.
like this picture:
I am puzzled when I try to do this, there are some problems puzzled me.
3-states phone model also have other 2 states, begin state and end state, generally the transition only allowed happend on self state to self state and self to next state. So what is the transition should be from one phone end state to next phone start state? Or the start and end state should be ignored when concatenating? I found HTK cookbook direct to concatenate the start and end state, but set transition from one phone end state to next phone start state to 1.0.
One word may contains a phone in many times, how to concatenate the same phone in A(transition) matrix and B(emission) matrix? I understand to concatenate the A matrix directly with the transition from one phone end state to next phone start state 1.0. The B matrix was shared, the same phone state has the same emission distribution.
I have may different words to train. After I trained one word, I got the three parameters: A matrix, B matrix, and Pi for the phones in the word, how can I use these parameters to train another word?