0

I am writing a real-time speech to text program. I am using Deepspeech for STT and Sounddevice for microphone capturing. However, I find the words form the audio seems to be "extended" and cannot be recognized accurately. For example, When I said "testing", the output will be "e te te te ing"

below is part of my code, I really want to know what is the problem and how should I solve this, thanks.

import deepspeech
import sounddevice as sd
import numpy as np

ds_model = deepspeech.Model("C:/Users/somthing else/deepspeech-0.9.3-models.pbmm")

def microphone_input(argument):
        audio = sd.rec(int(3 * 16000), samplerate=16000, channels=1)
        audio_data_int16 = audio.astype(np.int16)
        return audio_data_int16

    def output(self):
        audio = self.microphone_input()
        text = ds_model.stt(audio)
        print("Transcribed Text:", text)

I've tried changing the sample rate to 48000 and it is even more worse tried scaling the audio data to the range of -1.0 to 1.0 but it doesn't help tried recording stereo audio and it also doesn't help

0 Answers0