I'm using the excellent deepspeech package to transcribe an audio file in Python. Here's my quick implementation:
import wave
import deepspeech
import numpy as np
model_file_path = 'deepspeech-0.9.3-models.pbmm'
model = deepspeech.Model(model_file_path)
filename = 'podcast.wav'
w = wave.open(filename, 'r')
frames = w.getnframes()
buffer = w.readframes(frames)
data16 = np.frombuffer(buffer, dtype=np.int16)
text = model.stt(data16)
podcast.wav
is a ~20 minute audio file. Running text = model.stt(data16)
takes 10+ minutes (I interrupted the process after 10 minutes), which is unexpectedly slow given the availability of a GPU (I'm using Google Colab). I suspect that the script isn't using the GPU. Is there another implementation of the above code to ensure the use of a GPU? I can confirm that deepspeech-gpu
is installed.