1

I'm using Azure Speech To Text API to recognize small spoken recordings, from 10 seconds to 1 minute. Each speech recognition takes around 5 seconds to complete, which seems a bit too much!

Here is how I do it:

speech_config = speechsdk.SpeechConfig(subscription=speech_key, 
                                   region=service_region, 
                                   speech_recognition_language=language)
audio_config = speechsdk.audio.AudioConfig(filename=filepath)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, 
                                               audio_config=audio_config)

result = speech_recognizer.recognize_once()

I tried to identify the bottleneck, using timeit:

print(timeit.timeit(lambda : speechsdk.SpeechConfig(subscription=speech_key, 
                                     region=service_region, 
                                     speech_recognition_language=language), 
                    number=100))
>>> 0.004
print(timeit.timeit(lambda : speechsdk.audio.AudioConfig(filename=filepath), 
                    number=100))
>>> 0.003
print(timeit.timeit(lambda : speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config),
                   number=100))
>>> 0.118

print(timeit.timeit(lambda : print(speech_recognizer.recognize_once()),
                   number=5)) # Only doing this 5 times because it's very slow
>>> 35.01

I actually used a wrapper function to reinitialize the speech_recognizer because calling recognize() on it makes it not usable.

In this experiment it takes around 7 seconds to transcribe one 11 seconds recording.

I am transcribing audio files to French, using service_region = "westeurope"

Be Chiller Too
  • 2,502
  • 2
  • 16
  • 42

1 Answers1

1

if the audio length is 10s, recognition takes 5s.

it seems still to be reasonable. the RTF is 5/10 = 0.5

speech reco is a heavy process that need time for the algorithm and model to run

Sheng
  • 21
  • 1