0

I am trying to use the CMU Sphinx speech recognizer to recognize some speech files I record in WPF:

Here is the sample code I compiled:

package com.example;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;

import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;

public class TranscriberDemo {

    public static void main(String[] args) throws Exception {

        Configuration configuration = new Configuration();

        configuration
                .setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
        configuration
                .setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
        configuration
                .setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");

        StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(
                configuration);
        InputStream stream = new FileInputStream(new File("test.wav"));

        recognizer.startRecognition(stream);
        SpeechResult result;
        while ((result = recognizer.getResult()) != null) {
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
        }
        recognizer.stopRecognition();
    }
}

And here is the code how did I produce the wave file:

... in a WPF Window code behind ...

[DllImport("winmm.dll", EntryPoint = "mciSendStringA", CharSet = CharSet.Ansi, SetLastError = true, ExactSpelling = true)]
private static extern int mciSendString(string lpstrCommand, string lpstrReturnString, int uReturnLength, int hwndCallback);

private void OnRecButtonClicked(object sender, RoutedEventArgs e)
{
    mciSendString("close MediaFile ", "", 0, 0);
    mciSendString("open new Type waveaudio Alias recsound", "", 0, 0);
    mciSendString("set recsound channels 1", "", 0, 0);
    mciSendString("set recsound samplespersec 11025", "", 0, 0);
    mciSendString("set recsound alignment 4", "", 0, 0);
    mciSendString("set recsound bitspersample 16", "", 0, 0);
    mciSendString("record recsound", "", 0, 0);
    txtStatus.Text = "Recording...";
}

private void OnStopButtonClicked(object sender, RoutedEventArgs e)
{
    mciSendString("save recsound test.wav", "", 0, 0);
    mciSendString("close recsound ", "", 0, 0);
    txtStatus.Text = "Stopped...";
}

... more WPF ...

The result.getHypothesis() seems to always give empty string no matter what I say. How do I get started debug what is going wrong with the setup? Is that something is wrong with the encoding? With the speech quality? Or insufficient training? (The model I used come with the download) I am not a native English speaker so my voice isn't standard, but I am expecting the recognizer to give some output.

C:\Users\mike\Downloads\Sample>java -cp .;sphinx4-core-5prealpha-SNAPSHOT.jar;sphinx4-data-5prealpha-SNAPSHOT.jar com.example.TranscriberDemo
00:21:36.660 INFO unitManager          CI Unit: *+NSN+
00:21:36.660 INFO unitManager          CI Unit: *+SPN+
00:21:36.676 INFO unitManager          CI Unit: AA
00:21:36.676 INFO unitManager          CI Unit: AE
00:21:36.676 INFO unitManager          CI Unit: AH
00:21:36.676 INFO unitManager          CI Unit: AO
00:21:36.676 INFO unitManager          CI Unit: AW
00:21:36.691 INFO unitManager          CI Unit: AY
00:21:36.691 INFO unitManager          CI Unit: B
00:21:36.691 INFO unitManager          CI Unit: CH
00:21:36.707 INFO unitManager          CI Unit: D
00:21:36.707 INFO unitManager          CI Unit: DH
00:21:36.707 INFO unitManager          CI Unit: EH
00:21:36.707 INFO unitManager          CI Unit: ER
00:21:36.707 INFO unitManager          CI Unit: EY
00:21:36.723 INFO unitManager          CI Unit: F
00:21:36.723 INFO unitManager          CI Unit: G
00:21:36.723 INFO unitManager          CI Unit: HH
00:21:36.723 INFO unitManager          CI Unit: IH
00:21:36.723 INFO unitManager          CI Unit: IY
00:21:36.738 INFO unitManager          CI Unit: JH
00:21:36.738 INFO unitManager          CI Unit: K
00:21:36.738 INFO unitManager          CI Unit: L
00:21:36.738 INFO unitManager          CI Unit: M
00:21:36.738 INFO unitManager          CI Unit: N
00:21:36.754 INFO unitManager          CI Unit: NG
00:21:36.754 INFO unitManager          CI Unit: OW
00:21:36.754 INFO unitManager          CI Unit: OY
00:21:36.754 INFO unitManager          CI Unit: P
00:21:36.754 INFO unitManager          CI Unit: R
00:21:36.754 INFO unitManager          CI Unit: S
00:21:36.769 INFO unitManager          CI Unit: SH
00:21:36.769 INFO unitManager          CI Unit: T
00:21:36.769 INFO unitManager          CI Unit: TH
00:21:36.769 INFO unitManager          CI Unit: UH
00:21:36.769 INFO unitManager          CI Unit: UW
00:21:36.785 INFO unitManager          CI Unit: V
00:21:36.785 INFO unitManager          CI Unit: W
00:21:36.785 INFO unitManager          CI Unit: Y
00:21:36.785 INFO unitManager          CI Unit: Z
00:21:36.785 INFO unitManager          CI Unit: ZH
00:21:37.568 INFO autoCepstrum         Cepstrum component auto-configured as follows: autoCepstrum {MelFrequencyFilterBank, Denoise, DiscreteCosineTransform2, Lifter}

00:21:37.584 INFO dictionary           Loading dictionary from: jar:file:/C:/Users/mike/Downloads/Sample/sphinx4-data-5prealpha-SNAPSHOT.jar!/edu/cmu/sphinx/model
s/en-us/cmudict-en-us.dict
00:21:37.756 INFO dictionary           Loading filler dictionary from: jar:file:/C:/Users/mike/Downloads/Sample/sphinx4-data-5prealpha-SNAPSHOT.jar!/edu/cmu/sphin
x/models/en-us/en-us/noisedict
00:21:37.756 INFO acousticModelLoader  Loading tied-state acoustic model from: jar:file:/C:/Users/mike/Downloads/Sample/sphinx4-data-5prealpha-SNAPSHOT.jar!/edu/c
mu/sphinx/models/en-us/en-us
00:21:37.756 INFO acousticModelLoader  Pool means Entries: 16128
00:21:37.756 INFO acousticModelLoader  Pool variances Entries: 16128
00:21:37.756 INFO acousticModelLoader  Pool transition_matrices Entries: 42
00:21:37.756 INFO acousticModelLoader  Pool senones Entries: 5126
00:21:37.771 INFO acousticModelLoader  Gaussian weights: mixture_weights. Entries: 15378
00:21:37.771 INFO acousticModelLoader  Pool senones Entries: 5126
00:21:37.771 INFO acousticModelLoader  Context Independent Unit Entries: 42
00:21:37.771 INFO acousticModelLoader  HMM Manager: 137095 hmms
00:21:37.787 INFO acousticModel        CompositeSenoneSequences: 0
00:21:37.787 INFO trieNgramModel       Loading n-gram language model from: jar:file:/C:/Users/mike/Downloads/Sample/sphinx4-data-5prealpha-SNAPSHOT.jar!/edu/cmu/s
phinx/models/en-us/en-us.lm.bin
00:21:41.227 INFO lexTreeLinguist      Max CI Units 43
00:21:41.227 INFO lexTreeLinguist      Unit table size 79507
00:21:41.227 INFO speedTracker         # ----------------------------- Timers----------------------------------------
00:21:41.227 INFO speedTracker         # Name               Count   CurTime   MinTime   MaxTime   AvgTime   TotTime
00:21:41.242 INFO speedTracker         Load AM              1       3.2610s   3.2610s   3.2610s   3.2610s   3.2610s
00:21:41.242 INFO speedTracker         Load LM              1       1.6110s   1.6110s   1.6110s   1.6110s   1.6110s
00:21:41.242 INFO speedTracker         Compile              1       1.8290s   1.8290s   1.8290s   1.8290s   1.8290s
00:21:41.242 INFO speedTracker         Load Dictionary      1       0.1720s   0.1720s   0.1720s   0.1720s   0.1720s
00:21:41.289 INFO speedTracker            This  Time Audio: 1.03s  Proc: 0.01s  Speed: 0.01 X real time
00:21:41.289 INFO speedTracker            Total Time Audio: 1.03s  Proc: 0.01s 0.01 X real time
00:21:41.289 INFO memoryTracker           Mem  Total: 619.00 Mb  Free: 362.60 Mb
00:21:41.289 INFO memoryTracker           Used: This: 256.40 Mb  Avg: 256.40 Mb  Max: 256.40 Mb
00:21:41.289 INFO trieNgramModel       LM Cache Size: 0 Hits: 0 Misses: 0
Hypothesis:
00:21:41.321 INFO trieNgramModel       LM Cache Size: 0 Hits: 0 Misses: 0
00:21:41.321 INFO speedTracker         # ----------------------------- Timers----------------------------------------
00:21:41.321 INFO speedTracker         # Name               Count   CurTime   MinTime   MaxTime   AvgTime   TotTime
00:21:41.321 INFO speedTracker         Load AM              1       3.2610s   3.2610s   3.2610s   3.2610s   3.2610s
00:21:41.321 INFO speedTracker         Score                4       0.0160s   0.0000s   0.0160s   0.0080s   0.0320s
00:21:41.321 INFO speedTracker         Prune                10      0.0000s   0.0000s   0.0000s   0.0000s   0.0000s
00:21:41.336 INFO speedTracker         Grow                 14      0.0000s   0.0000s   0.0150s   0.0011s   0.0150s
00:21:41.336 INFO speedTracker         Load LM              1       1.6110s   1.6110s   1.6110s   1.6110s   1.6110s
00:21:41.336 INFO speedTracker         Compile              1       1.8290s   1.8290s   1.8290s   1.8290s   1.8290s
00:21:41.336 INFO speedTracker         Frontend             4       0.0160s   0.0000s   0.0160s   0.0080s   0.0320s
00:21:41.352 INFO speedTracker         Load Dictionary      1       0.1720s   0.1720s   0.1720s   0.1720s   0.1720s
00:21:41.352 INFO speedTracker            Total Time Audio: 1.03s  Proc: 0.01s 0.01 X real time
00:21:41.352 INFO memoryTracker           Mem  Total: 619.00 Mb  Free: 362.60 Mb
00:21:41.352 INFO memoryTracker           Used: This: 256.40 Mb  Avg: 256.40 Mb  Max: 256.40 Mb

Thanks a lot in advance for helping!

Andrew Au
  • 812
  • 7
  • 18
  • To start debug you need to provide sphinx4 log, it is printed on console. You also need to share the audio file. Most likely the wav file has wrong format, it should be 16khz 16bit mono, if you have different format you will not get any result. – Nikolay Shmyrev Sep 07 '16 at 06:47
  • Thanks for your comment. I have provided the sphinx4 log. – Andrew Au Sep 07 '16 at 07:27
  • I have also tried to update the WPF recorder so that it produces a format closest to the required format. It looks like the system only allow a few standard sampling rates such as 11025 or 44100, but not anything else, but it doesn't complain either, it is just that the created audio file cannot be play by media player. – Andrew Au Sep 07 '16 at 07:41
  • Audio file should play fine otherwise you do not record it. For sphinx4 you need the code which is able to resample. Since you write in C, it's probably better for you to use pocketsphinx on Windows, you do not need java then. In pocketsphinx you can accept 11khz sample rate if you configure input sample rate. – Nikolay Shmyrev Sep 07 '16 at 08:09
  • Finally some success - I converted the audio to the required 16 bit 16k sampling rate and mono using this website. http://audio.online-convert.com/convert-to-wav And finally the code is recognizing something! – Andrew Au Sep 08 '16 at 13:27
  • It is better to use 'sox' application. No need for websites then. the command line is `sox file.wav -r 16000 -c 1 converted.wav` – Nikolay Shmyrev Sep 08 '16 at 13:36

0 Answers0