The most alike question I found related to my question is this (simple speech recognition methods) but since had passed 3 years and the answers are not enough I will ask.
I want to compute, from scratch, a simple speech recognition system, I only need to recognize five words. As much as I know, the more used audio features for this application are the MFCC, and HMM for classification.
I'm able to extract the MFCC from audio but I still have some doubts about how to use the features for generating a model with HMM and then perform classification.
As I understand, I have to perform vector quantization. First I need to have a bunch of MFCC vectors, then apply a clustering algorithm to get centroids. Then, use the centroids to perform vector quantization, this means that I have to compare every MFCC vector and label it with the name of the centroid most alike.
Then, the centroids are the 'observable symbols' in the HMM. I have to introduce words to the training algorithm and create a HMM model for each word. Then, given an audio query I compare with all models and I say is the word with the highest probability.
First of all, is this procedure correct? Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?
Note: I don't want to use sphinx, android API, microsoft API or other library.
Note2: I would appreciate if you share more recent information for better techniques.