What is the next procedure after creating a CMUSphinx language model with my own dictionary?

Question

I have created my own CMUSphinx language model for Arabic language for a software that will be listening to a user and apply commands with my own dictionary that I've done it manually by hand, converted "arpa" language model type to "dmp" language model using the command sphinx_lm_convert -i ar.lm -o ar.lm.dmp, so here is the files that i have so far:

.txt (the commands text file)
.wfreq (freq of words file)
.idngram (ngram file)
.dic (dictionary file)
.phone (phonemes file)
.lm (arpa language model file)
.lm.dmp (Darpa Trigram dump language model file)

I then recorded my self of saying each word, each word has a its own .wav file and they are all in one folder that is separate from the folder where .dic, .txt, .lm exists.

My question is what is the next step as i was reading here http://cmusphinx.sourceforge.net/wiki/tutorial?

It says that Adapting existing acoustic model is the next step after building the language model, isn't it training the language model?

And if it is training, i have all the files required except the:

.transcription
.fileids

what should be inside these two files?

Thank

score 1 · Accepted Answer · answered Dec 29 '15 at 08:25

1

Procedure for training acoustic model is described in tutorial for Acoustic Model Training.

You need to create fileids and transcription files manually in a text editor or with a script if you want to convert existing transcription in any custom form to required format.

Fileids must list the file names, transcription file must list transcription for each of the files in a special format.

For example of acoustic model training database you can check inside an4 database.

answered Dec 29 '15 at 08:25

Nikolay Shmyrev

24,897
5
43
87

So i looked at the an4 database as you said, but i'm little bit confused about the transcription and file ids, i found `2 of fileids` files `and of transcription files`, for transcription i found `an4_train.transcription` and `an4_test.transcription`, for fileids i found `an4_train.fileids` and `an4_test.fileids`. I opened those files to see if they have the same contents but they are not. Could you please explain why they are different and would it make a problem if they are same? Cause i have only one folder it has wav files and it's named in a sequence: `1.wav` `2.wav` `3.wav` etc ... – 0x01Brain Dec 29 '15 at 21:21
Each sequence of these file represents a line in the txt file (the commands text file). – 0x01Brain Dec 29 '15 at 21:27

What is the next procedure after creating a CMUSphinx language model with my own dictionary?

1 Answers1