I'm using Azure Speech To Text API to recognize small spoken recordings, from 10 seconds to 1 minute. Each speech recognition takes around 5 seconds to complete, which seems a bit too much!
Here is how I do it:
speech_config = speechsdk.SpeechConfig(subscription=speech_key,
region=service_region,
speech_recognition_language=language)
audio_config = speechsdk.audio.AudioConfig(filename=filepath)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config,
audio_config=audio_config)
result = speech_recognizer.recognize_once()
I tried to identify the bottleneck, using timeit
:
print(timeit.timeit(lambda : speechsdk.SpeechConfig(subscription=speech_key,
region=service_region,
speech_recognition_language=language),
number=100))
>>> 0.004
print(timeit.timeit(lambda : speechsdk.audio.AudioConfig(filename=filepath),
number=100))
>>> 0.003
print(timeit.timeit(lambda : speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config),
number=100))
>>> 0.118
print(timeit.timeit(lambda : print(speech_recognizer.recognize_once()),
number=5)) # Only doing this 5 times because it's very slow
>>> 35.01
I actually used a wrapper function to reinitialize the speech_recognizer because calling recognize() on it makes it not usable.
In this experiment it takes around 7 seconds to transcribe one 11 seconds recording.
I am transcribing audio files to French, using service_region = "westeurope"