Sphinx4 requires the audio in the acoustic model to be segmented 5-30 seconds each. Why? And how do you segment the audio? When will you segment it at 5 seconds or at 10 seconds or at 25 seconds? Thank you dear sir!
Asked
Active
Viewed 83 times
2 Answers
1
Sphinxtrain performs alignment of text to audio for the training. It tries to match phonemes with the individual pieces of audio. When audio is long it is harder to get a good match because there are too many variants and possibilities for mistake, for that reason it is better to keep recommended utterance length.
When you segment the audio you need to split on silence regions, it is not much matter what is the utterance length, it is more important to have small silence regions in the beginning and in the end. Small silence region helps trainer to find context.

Nikolay Shmyrev
- 24,897
- 5
- 43
- 87
-
Does that mean dear sir that silence in between words is not a problem? Just the long silence at the beginning and at the end? Thank you so much for answering – Allen Pol Sep 08 '15 at 00:31
-
Short silence in the beginning and in the end and no silence between words. – Nikolay Shmyrev Sep 08 '15 at 15:32
-
I did not write "long silence". – Nikolay Shmyrev Sep 08 '15 at 15:39
0
As a rule of thumb, the longer the segment, the better it is. To segment the audio, you might need to look at sox. It has a trim command that would be handy for the segmentation.

Mido
- 665
- 10
- 20
-
This is wrong, it is not recommended to use very long segments for acoustic model training. – Nikolay Shmyrev Sep 03 '15 at 06:29