1

I've been trying to wrap my head around using sphinx4 to get a still image to animate when my girlfriend talks for twitch.tv. Something much like this general mittenz guy https://www.youtube.com/watch?v=L2oUE-C2g6Y The talking cat is what I'm trying to emulate.

I get lost when I'm needed to introduce the image to the equation. I've been using this as an example.

`package edu.cmu.sphinx.demo.hellowrld;
import edu.cmu.sphinx.frontend.util.Microphone;
import edu.cmu.sphinx.recognizer.Recognizer;
import edu.cmu.sphinx.result.Result;
import edu.cmu.sphinx.util.props.ConfigurationManager;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import models.Tts;

public class Speech {

  public static void main(String[] args) {
    ConfigurationManager cm;

    if (args.length > 0) {
        cm = new ConfigurationManager(args[0]);
    } else {
        ///tmp/helloworld.config.xml
        cm = new ConfigurationManager(Speech.class.getResource("speech.config.xml"));

    }
    Recognizer recognizer = (Recognizer) cm.lookup("recognizer");
    recognizer.allocate();

    Microphone microphone = (Microphone) cm.lookup("microphone");
    if (!microphone.startRecording()) {
        System.out.println("Cannot start microphone.");
        recognizer.deallocate();
        System.exit(1);
    }

    System.out.println("Say: (Hello | call) ( Naam | Baam | Caam | Some  )");

    while (true) {
        System.out.println("Start speaking. Press Ctrl-C to quit.\n");

        Result result = recognizer.recognize();

        if (result != null) {
            String resultText = result.getBestFinalResultNoFiller();
            System.out.println("You said: " + resultText + '\n');

                Tts ts = new Tts();
                try {
                    ts.load();
                    ts.say("Did you said: " + resultText);
                } catch (IOException ex) {

                } 
        } else {
            System.out.println("I can't hear what you said.\n");
        }
    }
  }
}`

Any help would be appreciated.

barryhunter
  • 20,886
  • 3
  • 30
  • 43
Spogsta
  • 11
  • 1

1 Answers1

0

Sphinx4 is not an appropriate tool for this task. It recognizes speech, not individual sounds in realtime. You need a sound recognizer, in a simple form an amplitude detector. The overall approach should look like this:

  1. Record small piece of audio, say 100ms.
  2. Calculate the amplitude of the speech (simple sum of squares of samples)
  3. Display appropriate picture (mouth widely open for loud chunks or closed in silence).

In a more advanced form you can recognize vowels and adjust the face picture depending on that. Vowels can be recognized with a GMM classifier. You can even record multiple emotions and display them in realtime. The realtime is a problem here because your recognizer needs a very short analysis time and that makes design of such system complex, that would be a several month project. You can find more detailed description here.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87