I am writing a real-time speech to text program. I am using Deepspeech for STT and Sounddevice for microphone capturing. However, I find the words form the audio seems to be "extended" and cannot be recognized accurately. For example, When I said "testing", the output will be "e te te te ing"
below is part of my code, I really want to know what is the problem and how should I solve this, thanks.
import deepspeech
import sounddevice as sd
import numpy as np
ds_model = deepspeech.Model("C:/Users/somthing else/deepspeech-0.9.3-models.pbmm")
def microphone_input(argument):
audio = sd.rec(int(3 * 16000), samplerate=16000, channels=1)
audio_data_int16 = audio.astype(np.int16)
return audio_data_int16
def output(self):
audio = self.microphone_input()
text = ds_model.stt(audio)
print("Transcribed Text:", text)
I've tried changing the sample rate to 48000 and it is even more worse tried scaling the audio data to the range of -1.0 to 1.0 but it doesn't help tried recording stereo audio and it also doesn't help