Simple voice recognition when whispering

Question

I'm trying to do simple voice to text mapping using pocketsphinx (. The grammar is very simple such as:

public <grammar> = (Matt, Anna, Tom, Christine)+ (One | Two | Three | Four | Five | Six | Seven | Eight | Nine | Zero)+ ;

e.g:

Tom Anna Three Three

yields

Tom Anna 33

I adapted the acoustic model (to take into account my foreign accent) and after that I received decent performance (~94% accuracy). I used training dataset of ~3minutes. Right now I'm trying to do the same but by whispering to the microphone. The accuracy dropped significantly to ~50% w/o training. With training for accent I got ~60%. I tried other thinks including denoising and boosting volume. I read the whole docs but was wondering if anyone could answer some questions so I can better know in which direction should I got to improve performance.

1) in tutorial you are adapting hub4wsj_sc_8k acustic model. I guess "8k" is a sampling parameter. When using sphinx_fe you use "-samprate 16000". Was it used deliberately to train 8k model using data with 16k sampling rate? Why data with 8k sampling haven't been used? Does it have influence on performance?

2) in sphinx 4.1 (in comparison to pocketsphinx) there are differenct acoustic models e.g. WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar. Can those models be used with pocketsphinx? Will acustic model with 16k sampling have typically better performance with data having 16k sampling rate?

3) when using data for training should I use those with normal speaking mode (to adapt only for my accent) or with whispering mode (to adapt to whisper and my accent)? I think I tried both scenarios and didn't notice any difference to draw any conclussion but I don't know pocketsphinx internals so I might be doing something wrong.

4) I used the following script to record adapting training and testing data from the tutorial:

for i in `seq 1 20`; do 
       fn=`printf arctic_%04d $i`; 
       read sent; echo $sent; 
       rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null; 
done < arctic20.txt

I noticed that each time I hit Control-C this keypress is distinct in the recorded audio that leaded to errors. Trimming audio somtimes helped to correct to or lead to other error instead. Is there any requirement that each recording has some few seconds of quite before and after speaking?

5) When accumulating observation counts is there any settings I can tinker with to improve performance?

6) What's the difference between semi-continuous and continuous model? Can pocketsphinx use continuous model?

7) I noticed that 'mixture_weights' file from sphinx4 is much smaller comparing to the one you got in pocketsphinx-extra. Does it make any difference?

8) I tried different combination of removing white noise (using 'sox' toolkit e.g. sox noisy.wav filtered.wav noisered profile.nfo 0.1). Depending on the last parameter sometimes it improved a little bit (~3%) and sometimes it makes worse. Is it good to remove noise or it's something pocketsphinx doing as well? My environment is quite is there is only white noise that I guess can have more inpack when audio recorded whispering.

9) I noticed that boosting volume (gain) alone most of the time only maked the performance a little bit worse even though for humans it was easier to distinguish words. Should I avoid it?

10) Overall I tried different combination and the best results I got is ~65% when only removing noise, so only slight (5%) improvement. Below are some stats:

//ORIGNAL UNPROCESSED TESTING FILES
TOTAL Words: 111 Correct: 72 Errors: 43
TOTAL Percent correct = 64.86% Error = 38.74% Accuracy = 61.26%
TOTAL Insertions: 4 Deletions: 13 Substitutions: 26


//DENOISED + VOLUME UP
TOTAL Words: 111 Correct: 76 Errors: 42
TOTAL Percent correct = 68.47% Error = 37.84% Accuracy = 62.16%
TOTAL Insertions: 7 Deletions: 4 Substitutions: 31


//VOLUME UP
TOTAL Words: 111 Correct: 69 Errors: 47
TOTAL Percent correct = 62.16% Error = 42.34% Accuracy = 57.66%
TOTAL Insertions: 5 Deletions: 12 Substitutions: 30

//DENOISE, threshold 0.1
TOTAL Words: 111 Correct: 77 Errors: 41
TOTAL Percent correct = 69.37% Error = 36.94% Accuracy = 63.06%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 31


//DENOISE, threshold 0.21
TOTAL Words: 111 Correct: 80 Errors: 38
TOTAL Percent correct = 72.07% Error = 34.23% Accuracy = 65.77%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 28

Those processing I was doing only for testing data. Should the training data be processed in the same way? I think I tried that but there was barely any difference.

11) In all those testing I used ARPA language model. When using JGSF results where usually much worse (I have the latest pocketsphinx branch). Why is that?

12) Because is each sentence the maximum number would be '999' and no more than 3 names, I modified the JSGF and replaced repetition sign '+' by repeating content in the parentheses manually. This time the result where much closer to ARPA. Is there any way in grammar to tell maximum number of repetition like in regular expression?

13) When using ARPA model I generated it by using all possible combinations (since dictionary is fixed and really small: ~15 words) but then testing I was still receiving somtimes illegal results e.g. Tom Anna (without any required number). Is there any way to enforce some structure using ARPA model?

14) Should the dictionary be limited only to those ~15 words or just full dictionary will only affect speed but not performance?

15) Is modifying dictionary (phonemes) the way to go to improve recognition when whispering? (I'm not an expert but when we whisper I guess some words might sounds different?)

16) Any other tips how to improve accuracy would be really helpful!

score 0 · Answer 1 · answered Apr 18 '16 at 12:33

Regarding whispering: when you do so, the sound waves don't have meaningful aperiodic parts (vibrations that result from your vocal cords resonating normally, but not when whispering). You can try this by putting your finger to your throat while loudly speaking 'aaaaaa', and then just whispering it.

AFAIR acoustic modeling relies a lot on taking the frequency spectrum of the sound to detect peaks (formants) and relate them to phones (like vowels).

Educated guess: when whispering, the spectrum is mostly white-noise, slightly shaped by the oral position (tongue, openness of mouth, etc), which is enough for humans, but far not enough to make the peeks distinguishable by a computer.

Simple voice recognition when whispering

1 Answers1