How can I add new words or vocabulary into kaldi platform?

Question

I am trying to create a ASR system with existing pre-trained models available as a sample. I got stuck in a place where how to add new words into that trained model, so that next time it will correctly return the word; Some sort of machine learning concept. Any ideas will be helpful.

score 2 · Answer 1 · answered Dec 07 '16 at 09:55

There are two things you might need:

Lexicon: Try to find something like lexicon.txt in your data folder, add your words and corresponding phone sequences in it, like:
```
speech s p iy ch
the dh ax
the dh iy
```
Language Model: Find something like XXX.lm in your data folder, add your word in 1-gram with a probabiliy, like:
```
\data\
ngram 1=200
ngram 2=4000
...

\1-grams
-7.3241 the
...
```

After this, make the decoder HCLG.fst again based on these 2 new files.

Note: Numbers in language will make the results of speech recognition different, you need to choose a proper number, or use toolkit srilm to generate it by the text of your corpus.

This answer is the right way to go, any idea on how to add unigram to ARPA file? Manually? — xtluo, Nov 27 '19 at 09:45

How can I add new words or vocabulary into kaldi platform?

1 Answers1