Simple speech recognition from scratch

Question

The most alike question I found related to my question is this (simple speech recognition methods) but since had passed 3 years and the answers are not enough I will ask.

I want to compute, from scratch, a simple speech recognition system, I only need to recognize five words. As much as I know, the more used audio features for this application are the MFCC, and HMM for classification.

I'm able to extract the MFCC from audio but I still have some doubts about how to use the features for generating a model with HMM and then perform classification.

As I understand, I have to perform vector quantization. First I need to have a bunch of MFCC vectors, then apply a clustering algorithm to get centroids. Then, use the centroids to perform vector quantization, this means that I have to compare every MFCC vector and label it with the name of the centroid most alike.

Then, the centroids are the 'observable symbols' in the HMM. I have to introduce words to the training algorithm and create a HMM model for each word. Then, given an audio query I compare with all models and I say is the word with the highest probability.

First of all, is this procedure correct? Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?

Note: I don't want to use sphinx, android API, microsoft API or other library.

Note2: I would appreciate if you share more recent information for better techniques.

score 3 · Accepted Answer · answered May 06 '14 at 08:21

3

First of all, is this procedure correct?

The vector quantization part is ok, but it's rarely used these days. You describe so-called discrete HMMs which nobody uses for speech. If you want continuous HMMs with GMM as probability distribution for emissions you don't need vector quantization.

Then, you focused on less important steps like MFCC extraction but skipped most important parts like HMM training with Baum-Welch and HMM decoding with Viterbi which are way more complex part of the training than initial estimation of the states with vector quantization.

Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?

If you decode speech you usually select the symbols which correspond to parts phonemes perceived by the human. Its traditional to take 3 symbols per phoneme. For example word "one" should have 9 states for 3 phonemes and word "seven" should have 15 states for 5 phonemes. This practice is proven to be effective. Of course you can vary this estimation slightly.

answered May 06 '14 at 08:21

Nikolay Shmyrev

24,897
5
43
87

Another question, Let's suppose that I have a model for the word "loan" and another model for the word "loaner". If I am capturing audio, how do I know when to stop for searching a word? Because my algorithm could say that I have the word "loan" but in reality it's going to be "loaner" but it hasn't finisihed. – jessica May 15 '14 at 17:58
Yes, you can't determine the result from the beginning, you need to analyze the audio ahead to figure out what was said. For dynamic search one can use "backtracking" to figure stable search part. You also need to employ the context, you can figure the probability of 'loaner' from the previous words. This is a high-level description of the algorithms and methods. – Nikolay Shmyrev May 15 '14 at 18:18
In my current problem, I only have to recognize the phrase "shut up". So I think I'm only going to generate one model of 8 states (including the brief silence). The model itself can characterize the possibles shifts in time? for example if the person says "shuuut up"? – jessica May 15 '14 at 18:25
HMM model allows time scaling so shift in times are needed. Beside 8 states for phrase and silence you need a garbage state to cover all other speech, an alternative. Otherwise your model will always recognize the word. – Nikolay Shmyrev May 15 '14 at 19:00
Garbage state is a new concept to me. To train that way, should I train using a lot of different ways that can happen when saying "shut up" as "shutSILENCEup" "shutSMALLNOISEup" "shutINTENSENOISEup"?. Is it easier to train both words separated, like one model for "shut" and another for "up"? – jessica May 15 '14 at 19:07

Simple speech recognition from scratch

1 Answers1

Linked