0

I'm running PocketSphinx on Android (version 5prealpha). I'm using a file-defined keyword recognizer, specified by the following snippet (kwfile is the keyword definition file, and mRecognizer is an instance of SpeechRecognizer):

mRecognizer.addKeywordSearch(DESCRIPTOR, kwfile);

Overall, the recognition performance is pretty good, after having optimized the keyword thresholds. However, if I wait some arbitrary amount of time (5 sec up to several minutes) between one keyword utterance and the next, the recognition performance suffers on the second utterance. For example, I'll say "keyword," and it will be recognized. If I wait less than 5 sec and say "keyword" again, the second utterance will likely be recognized (recognition rate over 95%). If, however, I wait 15 sec, the recognition rate drops dramatically, to less than 50%.

My hypothesis is that when I say the keyword the second time, the recognizer is in the middle of a refresh - that is it's between a Stop Recognition event and a Start Recognition event, and that my speech transcends that event. Here is a typical view of my logcat. Notice that after 5 sec, the recognizer "refreshes". This happens about every 5 sec, for the most part. Sometimes it can be as long as 30 sec between "refreshes", but generally it's around 5 sec.

09-26 07:11:06.800  20397-20397/...﹕ Start recognition "kwfile"
09-26 07:11:06.815  20397-23642/...﹕ Starting decoding
09-26 07:11:11.310  20397-20397/...﹕ Stop recognition
09-26 07:11:11.315  20397-20397/...﹕ Start recognition "kwfile"
09-26 07:11:11.360  20397-23645/...﹕ Starting decoding
09-26 07:11:17.405  20397-20397/...﹕ Stop recognition

So, my question is: Is there anything I can do to control this "refresh rate"? Is this caused by something I'm doing wrong in my RecognitionListener implementation (see below, but note - I typically don't get any partial results between utterances.)? Or is there a PocketSphinx API call that I don't know about to set this refresh rate? Or, is there something I could change in the PocketSphinx source to improve this behavior?

class VoiceListener implements RecognitionListener{

        private boolean isCommand = false;

        @Override
        public void onBeginningOfSpeech() {
            Log.d(TAG,"Beginning of Speech");
            // do nothing
        }

        @Override
        public void onEndOfSpeech() {
            Log.d(TAG,"End of Speech");
            // do nothing
        }

        @Override
        public void onPartialResult(Hypothesis arg0) {
            if( arg0 != null){
                Log.d(TAG, "Partial results list: " + arg0.getHypstr());

                isCommand = false;

                // handle recognition results for keywords
                for( String command : this.getCurrentCommands() ) {
                    if (arg0.getHypstr().contains(command)) {
                        this.onRecognition(arg0.getHypStr());
                        isCommand = true;
                        mRecognizer.stop();
                    }
                }

                // call stop, and let onResults() handle grammar results
                if( arg0.getHypstr().contains(Command.STOP_WORD))
                    mRecognizer.stop();

            }
        }

        @Override
        public void onResult(Hypothesis results) {

            String data;
            if( results == null ){
                data = null;
            }else{
                data = results.getHypstr();
            }

            Log.d(TAG,"Final results: " + data );

            // handle grammar recognition results
            if( !isCommand ){
                this.onRecognition(data);
            }
            return;

        }
Brad Kriel
  • 53
  • 8

1 Answers1

0

There is no such thing as "refresh rate". Recognition accuracy drops probably because you have some noise on the background and it is not properly filtered out. You can study raw dumps to investigate if silence is counted as speech. You can share raw audio dumps to get help on this issue.

In your code there are things which are not very reasonable. If you are using keyword spotting only, there is no need to stop and restart the recognizer in onEndOfSpeech as you are doing now, you could just skip it. In spotting mode you do not need to wait for the end of speech to get a result, you can just use partial result to invoke actions and restart recognizer.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Thanks, Nicolay. I'm actually doing both keyword and grammar recognition using the same RecognitionListener. So, I could either take out the stop() and see if that improves things, or I could break it up into two SpeechRecognizer instances, one for keywords and the other for grammars, and then have two separate Listeners, one with the stop and the other without it. Anyway, I'll fiddle with the options and let you know how it works out. If I'm still having trouble, I'll add some audio logs. – Brad Kriel Sep 28 '15 at 17:52
  • It is not a good idea to use two instances. You need to use single instance then and switch between searches, but you should also track current search and avoid stop of recognizer if current search is keyword. Android demo does it in a proper way, you can just follow it. – Nikolay Shmyrev Sep 28 '15 at 19:18
  • OK. I got it. I'll modify my code block with your answer as well. I took everything out of onEndOfSpeech(), and handle keywords right in onPartialResult() and grammars in onResult(). An added bonus is that now keyword recognition happens measurably faster because it doesn't have to wait for the recognizer to call onResult(). Thanks! – Brad Kriel Sep 29 '15 at 23:00