Why do the results of this DeepSpeech python program differ from the results I get from the command line interface?

Question

I'm learning about Mozilla's DeepSpeech Speech-To-Text engine. I had no trouble getting the command line interface working, but the Python interface seems to be behaving differently. When I run:

deepspeech --model models/output_graph.pb --alphabet models/alphabet.txt --audio testFile3.wav

On a PCM, 16 bit, mono 48000 Hz .wav file generated with sox, I get the following:

test test apple benana

Minus the "benana" when I meant "banana" it seems to work fine, along with the other files I've tested it on. The problem comes when I try to use the following code which comes from this tutorial:

import deepspeech
import scipy.io.wavfile as wav
import sys

ds=deepspeech.Model(sys.argv[1],26,9,sys.argv[2],500)
fs,audio=wav.read(sys.argv[3])
processed_data=ds.stt(audio,fs)

print(processed_data)

I run the code with the following command:

python3 -Bi test.py models/output_graph.pb models/alphabet.txt testFile3.wav

Depending on the specific file, I get different four-character responses. The response I got from this particular file was 'hahm', but 'hmhm' and ' eo' are also common. Changing the parameters to the model (the 25, 9, and 500) don't seem to change the output.

Karthikeyan K · Accepted Answer · 2018-12-15T06:51:06.173

just include your trie and lm.binary files and try again.

from deepspeech import Model
import scipy.io.wavfile

BEAM_WIDTH = 500
LM_WEIGHT = 1.50
VALID_WORD_COUNT_WEIGHT = 2.25
N_FEATURES = 26
N_CONTEXT = 9
MODEL_FILE = 'output_graph.pbmm'
ALPHABET_FILE = 'alphabet.txt'
LANGUAGE_MODEL =  'lm.binary'
TRIE_FILE =  'trie'

ds = Model(MODEL_FILE, N_FEATURES, N_CONTEXT, ALPHABET_FILE, BEAM_WIDTH)

ds.enableDecoderWithLM(ALPHABET_FILE, LANGUAGE_MODEL, TRIE_FILE, LM_WEIGHT, 
VALID_WORD_COUNT_WEIGHT)

def process(path):
    fs, audio = scipy.io.wavfile.read(path)
    processed_data = ds.stt(audio, fs)
    return processed_data   

process('sample.wav')

this might produce same response..use same audio files fir both inference and verify.. the audio files should be 16 bit 16000 hz and mono recording..

Thanks, including the other files was a good tip, but the problem turned out to be that I was using 48000 Hz instead of 16000! — Display name, Dec 15 '18 at 17:52

score 2 · Answer 2 · answered Dec 14 '18 at 23:57

2

You should convert it to 16000 Hz, most of the issues related to weird output belongs to incorrect audio format. Loading the language model also can improve WER.

answered Dec 14 '18 at 23:57

Carlos Fonseca Murillo

21
1
4

Why do the results of this DeepSpeech python program differ from the results I get from the command line interface?

2 Answers2

Linked