1

I'm learning about Mozilla's DeepSpeech Speech-To-Text engine. I had no trouble getting the command line interface working, but the Python interface seems to be behaving differently. When I run:

deepspeech --model models/output_graph.pb --alphabet models/alphabet.txt --audio testFile3.wav

On a PCM, 16 bit, mono 48000 Hz .wav file generated with sox, I get the following:

test test apple benana

Minus the "benana" when I meant "banana" it seems to work fine, along with the other files I've tested it on. The problem comes when I try to use the following code which comes from this tutorial:

import deepspeech
import scipy.io.wavfile as wav
import sys

ds=deepspeech.Model(sys.argv[1],26,9,sys.argv[2],500)
fs,audio=wav.read(sys.argv[3])
processed_data=ds.stt(audio,fs)

print(processed_data)

I run the code with the following command:

python3 -Bi test.py models/output_graph.pb models/alphabet.txt testFile3.wav

Depending on the specific file, I get different four-character responses. The response I got from this particular file was 'hahm', but 'hmhm' and ' eo' are also common. Changing the parameters to the model (the 25, 9, and 500) don't seem to change the output.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Display name
  • 721
  • 2
  • 11
  • 29

2 Answers2

3

just include your trie and lm.binary files and try again.

from deepspeech import Model
import scipy.io.wavfile

BEAM_WIDTH = 500
LM_WEIGHT = 1.50
VALID_WORD_COUNT_WEIGHT = 2.25
N_FEATURES = 26
N_CONTEXT = 9
MODEL_FILE = 'output_graph.pbmm'
ALPHABET_FILE = 'alphabet.txt'
LANGUAGE_MODEL =  'lm.binary'
TRIE_FILE =  'trie'

ds = Model(MODEL_FILE, N_FEATURES, N_CONTEXT, ALPHABET_FILE, BEAM_WIDTH)

ds.enableDecoderWithLM(ALPHABET_FILE, LANGUAGE_MODEL, TRIE_FILE, LM_WEIGHT, 
VALID_WORD_COUNT_WEIGHT)

def process(path):
    fs, audio = scipy.io.wavfile.read(path)
    processed_data = ds.stt(audio, fs)
    return processed_data   

process('sample.wav')

this might produce same response..use same audio files fir both inference and verify.. the audio files should be 16 bit 16000 hz and mono recording..

Karthikeyan K
  • 229
  • 2
  • 7
  • 1
    Thanks, including the other files was a good tip, but the problem turned out to be that I was using 48000 Hz instead of 16000! – Display name Dec 15 '18 at 17:52
2

You should convert it to 16000 Hz, most of the issues related to weird output belongs to incorrect audio format. Loading the language model also can improve WER.