I'm learning about Mozilla's DeepSpeech Speech-To-Text engine. I had no trouble getting the command line interface working, but the Python interface seems to be behaving differently. When I run:
deepspeech --model models/output_graph.pb --alphabet models/alphabet.txt --audio testFile3.wav
On a PCM, 16 bit, mono 48000 Hz .wav file generated with sox, I get the following:
test test apple benana
Minus the "benana" when I meant "banana" it seems to work fine, along with the other files I've tested it on. The problem comes when I try to use the following code which comes from this tutorial:
import deepspeech
import scipy.io.wavfile as wav
import sys
ds=deepspeech.Model(sys.argv[1],26,9,sys.argv[2],500)
fs,audio=wav.read(sys.argv[3])
processed_data=ds.stt(audio,fs)
print(processed_data)
I run the code with the following command:
python3 -Bi test.py models/output_graph.pb models/alphabet.txt testFile3.wav
Depending on the specific file, I get different four-character responses. The response I got from this particular file was 'hahm'
, but 'hmhm'
and ' eo'
are also common. Changing the parameters to the model (the 25, 9, and 500) don't seem to change the output.