I'm trying to record audio at a client, send it to a "server," and then use speech-to-text at the "server" with Sphinx4. My code:
public class SoundModifier implements Runnable
{
private static final String ACOUSTIC_MODEL = "resource:/edu/cmu/sphinx/models/en-us/en-us";
private static final String DICTIONARY_PATH = "resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict";
private static final String GRAMMAR_PATH = "resource:/edu/cmu/sphinx/demo/dialog/";
private static final String LANGUAGE_MODEL = "resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin";
// other unrelated stuff
public SoundModifier(ConcurrentLinkedQueue inputQueue, ConcurrentLinkedQueue outputQueue, String saveFolder) throws IOException
{
Configuration configuration = new Configuration();
configuration.setAcousticModelPath(ACOUSTIC_MODEL);
configuration.setDictionaryPath(DICTIONARY_PATH);
configuration.setLanguageModelPath(LANGUAGE_MODEL);
configuration.setSampleRate(16000);
recognizer = new StreamSpeechRecognizer(configuration);
// other unrelated stuff
}
@Override
public void run()
{
var now = ZonedDateTime.now();
while(running)
{
while (inputQueue.size() > 0)
{
byte[] chunk = (byte[]) inputQueue.poll();
byte[] copy = Arrays.copyOf(chunk, chunk.length);
try
{
getText(copy);
}
catch (IOException ex)
{
Logger.getLogger(SoundModifier.class.getName()).log(Level.SEVERE, null, ex);
}
recordBytes.write(copy, 0, copy.length);
byte[][] send = new byte[][]{"audio".getBytes(), copy };
outputQueue.add(send);
}
}
String time = now.getYear() + "-" + now.getMonthValue() + "-" + now.getDayOfMonth() + "--" + now.getHour() + "-" + now.getMinute() + "-" + now.getSecond();
String filename = saveFolder + time + " SoundModifier.wav";
File file = new File(filename);
try
{
save(file);
}
catch (IOException ex)
{
Logger.getLogger(SoundRecorder.class.getName()).log(Level.WARNING, null, ex);
}
}
private ArrayList<WordResult> getText(byte[] input) throws IOException
{
ArrayList<WordResult> utteredWords = new ArrayList<>();
stream = new ByteArrayInputStream(input);
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null)
{
// var words = result.getWords();
// System.out.println("words: " + words);
// utteredWords.addAll(words);
System.out.format("Hypothesis: %s\n", result.getHypothesis());
serverFrame.setASRText(result.getHypothesis());
}
recognizer.stopRecognition();
return utteredWords;
}
public void save(File wavFile) throws IOException
{
byte[] audioData = recordBytes.toByteArray();
ByteArrayInputStream bais = new ByteArrayInputStream(audioData);
try (AudioInputStream audioInputStream = new AudioInputStream(bais, format, audioData.length / format.getFrameSize()))
{
AudioSystem.write(audioInputStream, AudioFileFormat.Type.WAVE, wavFile);
}
recordBytes.close();
LOGGER.log(Level.INFO, "recordBytes close");
}
}
This produces the following output:
11:23:33.703 INFO trieNgramModel LM Cache Size: 0 Hits: 0 Misses: 0
11:23:33.703 INFO speedTracker # ----------------------------- Timers----------------------------------------
11:23:33.703 INFO speedTracker # Name Count CurTime MinTime MaxTime AvgTime TotTime
11:23:33.703 INFO speedTracker Load Dictionary 46 0.0350s 0.0340s 0.0740s 0.0415s 1.9100s
11:23:33.703 INFO speedTracker Load AM 1 0.8700s 0.8700s 0.8700s 0.8700s 0.8700s
11:23:33.703 INFO speedTracker Frontend 184 0.0000s 0.0000s 0.0030s 0.0000s 0.0090s
11:23:33.703 INFO speedTracker Load LM 46 0.2640s 0.2320s 0.3450s 0.2699s 12.4150s
11:23:33.703 INFO speedTracker Score 184 0.0000s 0.0000s 0.0030s 0.0000s 0.0090s
11:23:33.703 INFO speedTracker Prune 460 0.0000s 0.0000s 0.0000s 0.0000s 0.0000s
11:23:33.703 INFO speedTracker Grow 644 0.0000s 0.0000s 0.0030s 0.0001s 0.0380s
11:23:33.703 INFO speedTracker Compile 46 0.3450s 0.2990s 0.6200s 0.3422s 15.7400s
11:23:33.703 INFO speedTracker Total Time Audio: 5.89s Proc: 0.03s 0.00 X real time
11:23:33.703 INFO memoryTracker Mem Total: 1186.00 Mb Free: 689.00 Mb
11:23:33.703 INFO memoryTracker Used: This: 497.00 Mb Avg: 657.31 Mb Max: 1468.03 Mb
11:23:33.703 INFO dictionary Loading dictionary from: jar:file:/C:/Users/???/.m2/repository/de/sciss/sphinx4-data/1.0.0/sphinx4-data-1.0.0.jar!/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict
11:23:33.743 INFO dictionary Loading filler dictionary from: jar:file:/C:/Users/???/.m2/repository/de/sciss/sphinx4-data/1.0.0/sphinx4-data-1.0.0.jar!/edu/cmu/sphinx/models/en-us/en-us/noisedict
11:23:33.743 INFO trieNgramModel Loading n-gram language model from: jar:file:/C:/Users/???/.m2/repository/de/sciss/sphinx4-data/1.0.0/sphinx4-data-1.0.0.jar!/edu/cmu/sphinx/models/en-us/en-us.lm.bin
11:23:33.902 INFO dictionary The dictionary is missing a phonetic transcription for the word '3-d'
11:23:33.903 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word '3-d'
11:23:33.903 INFO dictionary The dictionary is missing a phonetic transcription for the word 'adjustors'
11:23:33.904 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'adjustors'
11:23:33.904 INFO dictionary The dictionary is missing a phonetic transcription for the word 'adulyadej'
11:23:33.904 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'adulyadej'
11:23:33.915 INFO dictionary The dictionary is missing a phonetic transcription for the word 'chloroflourocarbons'
11:23:33.915 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'chloroflourocarbons'
11:23:33.925 INFO dictionary The dictionary is missing a phonetic transcription for the word 'déjà'
11:23:33.925 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'déjà'
11:23:33.940 INFO dictionary The dictionary is missing a phonetic transcription for the word 'iife'
11:23:33.940 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'iife'
11:23:33.952 INFO dictionary The dictionary is missing a phonetic transcription for the word 'mm-hm'
11:23:33.952 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'mm-hm'
11:23:33.952 INFO dictionary The dictionary is missing a phonetic transcription for the word 'mm-hmm'
11:23:33.952 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'mm-hmm'
11:23:33.952 INFO dictionary The dictionary is missing a phonetic transcription for the word 'mmmm'
11:23:33.952 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'mmmm'
11:23:33.954 INFO dictionary The dictionary is missing a phonetic transcription for the word 'ngo's'
11:23:33.954 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'ngo's'
11:23:33.956 INFO dictionary The dictionary is missing a phonetic transcription for the word 'occured'
11:23:33.956 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'occured'
11:23:33.956 INFO dictionary The dictionary is missing a phonetic transcription for the word 'offical'
11:23:33.956 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'offical'
11:23:33.956 INFO dictionary The dictionary is missing a phonetic transcription for the word 'officals'
11:23:33.956 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'officals'
11:23:33.963 INFO dictionary The dictionary is missing a phonetic transcription for the word 'port_au_prince'
11:23:33.963 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'port_au_prince'
11:23:33.963 INFO dictionary The dictionary is missing a phonetic transcription for the word 'possiblity'
11:23:33.963 WARNING trieNgramModel The dictionary is missing a phonetic transcription for the word 'possiblity'
11:23:33.987 WARNING trieNgramModel Dictionary is missing 15 words that are contained in the language model.
11:23:34.080 INFO dictionary The dictionary is missing a phonetic transcription for the word 'offical'
11:23:34.080 INFO dictionary The dictionary is missing a phonetic transcription for the word 'mm-hm'
11:23:34.081 INFO dictionary The dictionary is missing a phonetic transcription for the word 'adulyadej'
11:23:34.081 INFO dictionary The dictionary is missing a phonetic transcription for the word 'adjustors'
11:23:34.082 INFO dictionary The dictionary is missing a phonetic transcription for the word 'mm-hmm'
11:23:34.082 INFO dictionary The dictionary is missing a phonetic transcription for the word 'ngo's'
11:23:34.082 INFO dictionary The dictionary is missing a phonetic transcription for the word 'officals'
11:23:34.083 INFO dictionary The dictionary is missing a phonetic transcription for the word 'chloroflourocarbons'
11:23:34.083 INFO dictionary The dictionary is missing a phonetic transcription for the word '3-d'
11:23:34.084 INFO dictionary The dictionary is missing a phonetic transcription for the word 'déjà'
11:23:34.085 INFO dictionary The dictionary is missing a phonetic transcription for the word 'port_au_prince'
11:23:34.086 INFO dictionary The dictionary is missing a phonetic transcription for the word 'mmmm'
11:23:34.086 INFO dictionary The dictionary is missing a phonetic transcription for the word 'iife'
11:23:34.089 INFO dictionary The dictionary is missing a phonetic transcription for the word 'possiblity'
11:23:34.090 INFO dictionary The dictionary is missing a phonetic transcription for the word 'occured'
11:23:34.281 INFO lexTreeLinguist Max CI Units 43
11:23:34.281 INFO lexTreeLinguist Unit table size 79507
11:23:34.281 INFO speedTracker # ----------------------------- Timers----------------------------------------
11:23:34.281 INFO speedTracker # Name Count CurTime MinTime MaxTime AvgTime TotTime
11:23:34.281 INFO speedTracker Load Dictionary 47 0.0400s 0.0340s 0.0740s 0.0415s 1.9500s
11:23:34.281 INFO speedTracker Load AM 1 0.8700s 0.8700s 0.8700s 0.8700s 0.8700s
11:23:34.281 INFO speedTracker Frontend 184 0.0000s 0.0000s 0.0030s 0.0000s 0.0090s
11:23:34.281 INFO speedTracker Load LM 47 0.2440s 0.2320s 0.3450s 0.2693s 12.6590s
11:23:34.281 INFO speedTracker Score 184 0.0000s 0.0000s 0.0030s 0.0000s 0.0090s
11:23:34.281 INFO speedTracker Prune 460 0.0000s 0.0000s 0.0000s 0.0000s 0.0000s
11:23:34.281 INFO speedTracker Grow 644 0.0000s 0.0000s 0.0030s 0.0001s 0.0380s
11:23:34.281 INFO speedTracker Compile 47 0.2940s 0.2940s 0.6200s 0.3411s 16.0340s
11:23:34.282 INFO speedTracker This Time Audio: 0.13s Proc: 0.00s Speed: 0.00 X real time
11:23:34.282 INFO speedTracker Total Time Audio: 6.02s Proc: 0.03s 0.00 X real time
11:23:34.282 INFO memoryTracker Mem Total: 1186.00 Mb Free: 301.00 Mb
11:23:34.282 INFO memoryTracker Used: This: 885.00 Mb Avg: 659.76 Mb Max: 1468.03 Mb
11:23:34.282 INFO trieNgramModel LM Cache Size: 0 Hits: 0 Misses: 0
Hypothesis:
11:23:34.282 INFO trieNgramModel LM Cache Size: 0 Hits: 0 Misses: 0
11:23:34.282 INFO speedTracker # ----------------------------- Timers----------------------------------------
11:23:34.282 INFO speedTracker # Name Count CurTime MinTime MaxTime AvgTime TotTime
11:23:34.282 INFO speedTracker Load Dictionary 47 0.0400s 0.0340s 0.0740s 0.0415s 1.9500s
11:23:34.282 INFO speedTracker Load AM 1 0.8700s 0.8700s 0.8700s 0.8700s 0.8700s
11:23:34.282 INFO speedTracker Frontend 188 0.0000s 0.0000s 0.0030s 0.0000s 0.0090s
11:23:34.282 INFO speedTracker Load LM 47 0.2440s 0.2320s 0.3450s 0.2693s 12.6590s
11:23:34.282 INFO speedTracker Score 188 0.0000s 0.0000s 0.0030s 0.0000s 0.0090s
11:23:34.282 INFO speedTracker Prune 470 0.0000s 0.0000s 0.0000s 0.0000s 0.0000s
11:23:34.282 INFO speedTracker Grow 658 0.0000s 0.0000s 0.0030s 0.0001s 0.0380s
11:23:34.282 INFO speedTracker Compile 47 0.2940s 0.2940s 0.6200s 0.3411s 16.0340s
11:23:34.282 INFO speedTracker Total Time Audio: 6.02s Proc: 0.03s 0.00 X real time
11:23:34.282 INFO memoryTracker Mem Total: 1186.00 Mb Free: 301.00 Mb
11:23:34.282 INFO memoryTracker Used: This: 885.00 Mb Avg: 662.16 Mb Max: 1468.03 Mb
This type of output repeats while I'm recording audio with the client (and goes longer than the recording time actually, even though it looks like the processing time is 0.03 seconds).
Audio format is defined elsewhere:
public class StaticAudioFormat
{
private static final int channels = 1;
private static final boolean signed = true;
private static final boolean bigEndian = false;
private static final float sampleRate = 16000;
private static final int sampleSizeInBits = 16;
/**
* Defines a default audio format used to record
*/
static AudioFormat getAudioFormat()
{
return new AudioFormat(sampleRate, sampleSizeInBits, channels, signed, bigEndian);
}
}
I can read the saved audio into Audacity after-the-fact and it sounds fine. I can transcribe the recorded file with:
recognizer.startRecognition(new FileInputStream("???/2020-5-12--13-9-37 SoundModifier.wav"));
SpeechResult result = recognizer.getResult();
recognizer.stopRecognition();
System.out.println("---------------------------------------------------------------");
while ((result = recognizer.getResult()) != null) {
System.out.println(result.getHypothesis());
}
System.out.println("---------------------------------------------------------------");
...and it correctly outputs what I said.
What do I need to do to get Sphinx4's StreamSpeechRecognizer to output text from speech in real time?
Edit: I'm on Windows, which may preclude some options.