0

I'm using the MS Azure Speech-to-text service with Python.

My data input is a byte string, only a few seconds of audio. My expectation would be that the cloud service stops to process the audio when the end of the stream is finished and returns the recognized text. Instead it takes about 5 minutes until the recognized event is triggered.

           speech_config = speechsdk.SpeechConfig(subscription=API_KEY,
                                                   region="westeurope",
                                                   speech_recognition_language='de-DE')
            stream = PushAudioInputStream(stream_format=
                                          AudioStreamFormat(samples_per_second=sample_rate, bits_per_sample=SAMPLE_WIDTH * 8,
                                                            compressed_stream_format=speechsdk.AudioStreamContainerFormat.FLAC))
            audio_input = speechsdk.AudioConfig(stream=stream)
            stream.write(data)
            speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
            speech_recognizer.start_continuous_recognition()
            done = False

            def stop_recognition(evt):
                logger.debug("Stopped MS Azure recognition: %s", evt)
                nonlocal done
                done = True

            def recognized(evt):
                logger.info("Recognized MS Azure transcript: %s", evt)
                nonlocal text
                text += " " + evt.result.text

            speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
            speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
            speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
            speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
            speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

            speech_recognizer.recognized.connect(recognized)
            speech_recognizer.session_stopped.connect(stop_recognition)
            speech_recognizer.canceled.connect(stop_recognition)
            while not done:
                time.sleep(.5)
            speech_recognizer.stop_continuous_recognition()

Instead I see a delay of 5 minutes:

2022-11-13 23:58:19,504 - speech_processing.speech_recognition.speech_recognition - DEBUG - Sending 192000 bytes (6 sec) for recognition
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=50e5c478cdc34e0a8ced3867be493bc3, text="telefon", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=d1448833ac8f40ef9c1ebc4cae488bcd, text="telefonspeicher", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=cdf9f074c13b4a2c94960ec147db765c, text="telefon speichere", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=548133156bb44dc8ae08fd0848fa8ec5, text="telefon speichere als", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=c03970619f1e42278b2a2ef19ee4f1fe, text="telefon speichere als bärbel", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=ff5f6a18d1e4409cab2661582cb8a693, text="telefon speichere als bärbel 0", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=a3cb8f82c62b4235abc2fea2696342f8, text="telefon speichere als bärbel 03", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=cafc805031654aa4865a4fe1b742d1cd, text="telefon speichere als bärbel 038", reason=ResultReason.RecognizingSpeech))
RECOGNIZING: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=69782c9485244e3191846b924adb3807, text="telefon speichere als bärbel 0385", reason=ResultReason.RecognizingSpeech))
RECOGNIZED: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=9d92890d52d84b7f926a6977d6324ca1, text="Telefon speichere als Bärbel 0385.", reason=ResultReason.RecognizedSpeech))
2022-11-14 00:03:26,487 - speech_processing.speech_recognition.speech_recognition - INFO - Recognized MS Azure transcript: SpeechRecognitionEventArgs(session_id=2e4c92f4fed6498f8f5260199bdcc5d7, result=SpeechRecognitionResult(result_id=9d92890d52d84b7f926a6977d6324ca1, text="Telefon speichere als Bärbel 0385.", reason=ResultReason.RecognizedSpeech))
k_o_
  • 5,143
  • 1
  • 34
  • 43

1 Answers1

1

I found my error:

The stream must be closed:

stream.write(data)
stream.close()
k_o_
  • 5,143
  • 1
  • 34
  • 43