Why do you need to segment the audios 5-30 seconds each for building the acoustic model?

Question

Sphinx4 requires the audio in the acoustic model to be segmented 5-30 seconds each. Why? And how do you segment the audio? When will you segment it at 5 seconds or at 10 seconds or at 25 seconds? Thank you dear sir!

score 1 · Accepted Answer · answered Sep 03 '15 at 06:29

1

Sphinxtrain performs alignment of text to audio for the training. It tries to match phonemes with the individual pieces of audio. When audio is long it is harder to get a good match because there are too many variants and possibilities for mistake, for that reason it is better to keep recommended utterance length.

When you segment the audio you need to split on silence regions, it is not much matter what is the utterance length, it is more important to have small silence regions in the beginning and in the end. Small silence region helps trainer to find context.

answered Sep 03 '15 at 06:29

Nikolay Shmyrev

24,897
5
43
87

Does that mean dear sir that silence in between words is not a problem? Just the long silence at the beginning and at the end? Thank you so much for answering – Allen Pol Sep 08 '15 at 00:31
Short silence in the beginning and in the end and no silence between words. – Nikolay Shmyrev Sep 08 '15 at 15:32
I did not write "long silence". – Nikolay Shmyrev Sep 08 '15 at 15:39

score 0 · Answer 2 · answered Sep 02 '15 at 23:39

0

As a rule of thumb, the longer the segment, the better it is. To segment the audio, you might need to look at sox. It has a trim command that would be handy for the segmentation.

answered Sep 02 '15 at 23:39

Mido

665
10
20

This is wrong, it is not recommended to use very long segments for acoustic model training. – Nikolay Shmyrev Sep 03 '15 at 06:29

Why do you need to segment the audios 5-30 seconds each for building the acoustic model?

2 Answers2