Recognizing live speech with Sphinx4 java api

Question

I am trying to run the tutorial program for live speech recognition using Sphinx4. This is the main class:

public class LiveRecognition {

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
        configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
        configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");
        configuration.setUseGrammar(false);

        LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);

        recognizer.startRecognition(true);

        SpeechResult result;
        while ((result = recognizer.getResult()) != null) {
            for(WordResult word : result.getWords()) {
                System.out.println(word);
            }
        }
        recognizer.stopRecognition();
    }
}

So far I am using dictionary and acoustic models provided by Sphinx. When I run the program, it keeps producing random text almost as if it is talking with itself and whatever I am speaking through microphone, it doesn't even get close. For example output is like this:

....
{between, 1.000, [2700:3610]}
23:21:37.391 INFO speedTracker            This  Time Audio: 0.83s  Proc: 3.82s  Speed: 4.60 X real time
23:21:37.391 INFO speedTracker            Total Time Audio: 1.58s  Proc: 7.66s 4.85 X real time
23:21:37.391 INFO memoryTracker           Mem  Total: 1173.00 Mb  Free: 410.17 Mb
23:21:37.393 INFO memoryTracker           Used: This: 762.83 Mb  Avg: 507.82 Mb  Max: 762.83 Mb
23:21:37.393 INFO trieNgramModel       LM Cache Size: 4183 Hits: 990660 Misses: 4183
{<sil>, 1.000, [3610:5810]}
{what, 1.000, [5820:6380]}
23:21:41.615 INFO speedTracker            This  Time Audio: 0.55s  Proc: 2.21s  Speed: 4.01 X real time
23:21:41.615 INFO speedTracker            Total Time Audio: 2.13s  Proc: 9.87s 4.63 X real time
23:21:41.615 INFO memoryTracker           Mem  Total: 1316.50 Mb  Free: 540.36 Mb
23:21:41.615 INFO memoryTracker           Used: This: 776.14 Mb  Avg: 597.26 Mb  Max: 776.14 Mb
23:21:41.615 INFO trieNgramModel       LM Cache Size: 5332 Hits: 1263784 Misses: 5332
{<sil>, 1.000, [6380:9060]}
{ooh, 1.000, [9070:9280]}
....

What am I doing wrong? I want to see "hello world" when I say "hello world". Both words are present in the dictionary.

[UPDATE] I made a small language model file and corresponding dictionary using this online service from a small corpus file as described here. This time it worked with better accuracy using the default acoustic model provided with sphinx-data library. I don't need to train the acoustic model since I will be dealing mostly with English(US) language. But I want a good language model and dictionary for general purpose short sentences. Language model that comes with sphinx is not going well for me.

[UPDATE] Since Nikolay Shmyrev mentioned below it could be due to poor computing performance, this is what I use:

Intel® Core™ i7-4790 CPU @ 3.60GHz
16 GB DDR3 RAM
Windows 10 and Ubuntu 14.04

Processing power can be increased if needed.

score 0 · Answer 1 · answered Mar 22 '16 at 19:24

Your computer is too slow, it can not process audio in realtime, thus inaccurate. For slow computers use pocketsphinx instead.

Pocketsphinx has Java/JNI API too, you can find example here, it should look like this:

    Config c = Decoder.defaultConfig();
    c.setString("-hmm", "../../model/en-us/en-us");
    c.setString("-lm", "../../model/en-us/en-us.lm.bin");
    c.setString("-dict", "../../model/en-us/cmudict-en-us.dict");
    Decoder d = new Decoder(c);

    FileInputStream ais = new FileInputStream(new File("../../test/data/goforward.raw"));

    d.startUtt();
    d.setRawdataSize(300000);
    byte[] b = new byte[4096];
    int nbytes;
    while ((nbytes = ais.read(b)) >= 0) {
        ByteBuffer bb = ByteBuffer.wrap(b, 0, nbytes);
        bb.order(ByteOrder.LITTLE_ENDIAN);
        short[] s = new short[nbytes/2];
        bb.asShortBuffer().get(s);
        d.processRaw(s, nbytes/2, false, false);
    }
    d.endUtt();
    System.out.println(d.hyp().getHypstr());

    short[] data = d.getRawdata();
    System.out.println("Data size: " + data.length);
    DataOutputStream dos = new DataOutputStream(new  FileOutputStream(new File("/tmp/test.raw")));
    for (int i = 0; i < data.length; i++) {
        dos.writeShort(data[i]);
    }
    dos.close();

    for (Segment seg : d.seg()) {
        System.out.println(seg.getWord());
    }

I don't think mine is slow. My machine has a lot of ram and raw processing power. However I made some lm and dic files (language models and dictionary) using the online service provided in the tutorial, then it worked with much better accuracy. Thank you very much for the help, I will look into pocket sphinx and let you know if your solution worked / helped. Thanks. — Zobayer Hasan, Mar 29 '16 at 09:21
The log says 4.63 X real time which means it is slow. You probably want to give it more memory then — Nikolay Shmyrev, Mar 29 '16 at 11:21
I also have a similar issue ? .Did you get to make it work with the said powerful machine ? — Betafish, Mar 05 '18 at 07:36

Recognizing live speech with Sphinx4 java api

1 Answers1