Understanding variables from speech recognition paper in HMM-GMM

Question

I am reading this paper by Mark Gales and Steve Young on speech recognition using HMM-GMM. In page 205, second paragraph, it is written:

"For each utterance Y^(r) , r = 1, . . . , R, of length T^(r) the sequence of baseforms, the HMMs that correspond to the word-sequence in the utterance, is found and the corresponding composite HMM constructed"

I did not clearly understand what is Y^(r) and Tsup>(r) ? Can someone clarify it ? I did not understand what does r and R stands for ?

Similarly in this paper titled as : A Parallel Implementation of Viterbi Training for Acoustic Models using Graphics Processing Units, in section 2.1 the author mentions that :

"Given a set of training observations Osup>(r) , 1 ≤ r ≤ R and an HMM state sequence 1 < j < N the observations sequence is aligned to the state sequence via Viterbi alignment."

I know both sentences are similar but in above paper as well I did not understand what is r and R.

score 0 · Answer 1 · answered Jul 23 '20 at 10:38

In HMMs, you have time-sequential observations. Speech recognition is a special task since the observation length is not fixed but variable.

As far as I understand an observation Y(r) is given as:

Y(r) = {Y_0, Y_1, Y_2, ..., Y_R} so that r is an index with r = 0, 1, ..., R.

In this case, r is the count of observations and R is the last observation.

Understanding variables from speech recognition paper in HMM-GMM

1 Answers1