0

I am trying to understand the internal workings of Kaldi, however is having trouble understanding the technical details of kaldi's doc.

I want to have a high-level understanding of various objects first in order to help digest what is presented. I would specifically like to know what the .tree, fina.mdl, and HCLG.fst files are, what is needed to generate them and how they are being used.

Vaguely I understand that (please correct me if I am wrong):

  • final.mdl is the acoustic model and contains the probability of transitioning from one phone to another.
  • HCLG.fst is a graph that given a sequence of phones it will generate the most likely word sequence based on the lexicon, grammar and language model.
  • decoding-graph is the term for generating the HCLG.fst
  • not quite sure what adding a self-loop is, is it similar to the Kleene operator?
  • lattice contain alternative word-sequence for an utterance.

I understand there is a lot to cover but any help is appreciated!

kkawabat
  • 1,530
  • 1
  • 14
  • 37

2 Answers2

1

You'd better ask one question at a time. Also, it is better to read the book to understand the theory first instead of trying to grasp all at once.

final.mdl is the acoustic model and contains the probability of transitioning from one phone to another

The main component of the acoustic model model final.mdl is the acoustic detectors, not transitioning probabilities. It is either a set of GMMs for phones or a neural network. The acoustic model also contain the transition probabilities from one hmm state to another, what builds HMM model for a single phone. The transition probabilities between phones are encoded in the graph HCLG.fst

HCLG.fst is a graph that given a sequence of phones it will generate the most likely word sequence based on the lexicon, grammar and language model.

Not quite that, HCLG fst is a finite state transducer that gives you probability of a state sequence based on lexicon and language model. Phone sequences are not really used in graph, they are accounted on graph construction.

not quite sure what adding a self-loop is, is it similar to the Kleene operator?

Speech HMM has self-loops for every state, it allows the state to last for several input frames. You can find the HMM topology in the book to see the loops.

lattice contain alternative word-sequence for an utterance.

This is correct, but it also contains times and acoustic and language model scores.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Where is it written that you can ask only one question at a single time? Why do you reply if you have a problem with somebody asking multiple questions? – LITDataScience Apr 26 '22 at 06:32
1

But how are the transition probabilities for the HCLG, (namely the in "H" and "C") estimated ? I get that since G is simply a language model, the probabilities of transition between words can be estimated from a corpus but I do not understand how the transition probs for "H"(the transducer that converts HMM states into context dependent phones) can be estimated if I have a DNN for acoustic model and train it using the alignements of a GMM-HMM since the output of the DNN is a softmax with the emission probs. Are the transition probabilities simply taken from the GMM-HMM model or are they updated during training like emission probs?

John Doe
  • 59
  • 7