Azure Cognitive Services / Speech-to-text: Transcribe compressed PCMU (mu-law) wav files

Question

Using Azure Speech Service, I'm trying to transcribe a bunch a wav files (compressed in the PCMU aka mu-law format).

I came up with the following code based on the articles referenced below. The code works fine sometimes with few files, but I keep getting Segmentation fault errors while looping a bigger list of files (~50) and it never break on the same file (could be 2nd, 15th or 27th).

Also, when running a subset of files, transcription results seems the same with or without the decompression part of the code which makes me wonder if the decompression method recommended by Microsoft works at all.

import azure.cognitiveservices.speech as speechsdk

def azurespeech_transcribe(audio_filename):
    class BinaryFileReaderCallback(speechsdk.audio.PullAudioInputStreamCallback):
        def __init__(self, filename: str):
            super().__init__()
            self._file_h = open(filename, "rb")

        def read(self, buffer: memoryview) -> int:
            try:
                size = buffer.nbytes
                frames = self._file_h.read(size)
                buffer[:len(frames)] = frames
                return len(frames)
            except Exception as ex:
                print('Exception in `read`: {}'.format(ex))
                raise

        def close(self) -> None:
            try:
                self._file_h.close()
            except Exception as ex:
                print('Exception in `close`: {}'.format(ex))
                raise
    compressed_format = speechsdk.audio.AudioStreamFormat(
        compressed_stream_format=speechsdk.AudioStreamContainerFormat.MULAW
    )
    callback = BinaryFileReaderCallback(filename=audio_filename)
    stream = speechsdk.audio.PullAudioInputStream(
        stream_format=compressed_format,
        pull_stream_callback=callback
    )
    speech_config = speechsdk.SpeechConfig(
        subscription="<my_subscription_key>",
        region="<my_region>",
        speech_recognition_language="en-CA"
    )
    audio_config = speechsdk.audio.AudioConfig(stream=stream)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config, audio_config)
    result = speech_recognizer.recognize_once()
    return result.text

Code is running on WSL.

I have already tried:

Logging a more meaningful error with faulthandler module
Increasing Python stack limit: resource.setrlimit(resource.RLIMIT_STACK, (resource.RLIM_INFINITY, resource.RLIM_INFINITY))
Adding some sleep timers

References:

Sairam Tadepalli · Answer 1 · 2022-11-02T04:13:28.710

I tried to work on a similar dataset, and I didn’t get any segmentation fault. Check with the subscription and deployment pattern with pricing tier. Implemented the same with the custom speech to text translator and it worked in the segmentation also.

Check with the pricing tier which is creating segmentation fault
Check with the subscription allowance
Check to train in custom speech studio and test.

The segmentation differs from the location to location and the pricing tier.

After running the syntax, I didn't get any segmentation error as the pricing tier is suitable for the volume of the data.

score 0 · Answer 2 · answered Nov 29 '22 at 22:22

From 1.24.0 Speech SDK version (and onwards), you can stream ALAW/MULAW encoded data directly to speech service (without the need of Gstreamer) by using AudioStreamWaveFormat (https://learn.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.audiostreamwaveformat?view=azure-python). This way there is less complexity involved (no Gstreamer).

encoded_format = msspeech.audio.AudioStreamFormat(samples_per_second=16000, bits_per_sample=16,
                                                      channels=1, wave_stream_format=msspeech.AudioStreamWaveFormat.MULAW)

Azure Cognitive Services / Speech-to-text: Transcribe compressed PCMU (mu-law) wav files

2 Answers2