Azure speech to text in real time from .raw file

Question

I am trying to create a Speech to text script that, using a .raw file in which audio data is getting added to from a web socket server, transcribes .raw audio data in real time and converts it to text for further processing. I am very new to programming and would like a working python code to use as an example or at least some kind of guide for me to go off of.

I have installed SpeechSDK and all the necessary things needed for this, but just simply don't know how to do it. Only remote example I have is from the speech_recognition_with_pull_stream section in https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/python/console/speech_sample.py, but I don't know where I'm supposed to put in the file.

What is a .raw file in audio context? Do you mean some sort of WAV file? — Marcel, Feb 21 '23 at 06:53
@Marcel I have implemented a plugin in my asterisk setup, that, using a web socket, transmits data to a server and writes the data to a created .raw file. The issue I'm facing is, for example, if the person were to talk for 2 minutes, you'd have to wait 2 minutes for the STT to transcribe the speech to text. — Yuuki, Feb 21 '23 at 08:37

score 0 · Answer 1 · answered Feb 21 '23 at 06:44

I created one speech resource on my Azure Portal:-

Visit your Azure Portal > Create a resource > Search for Speech and Click on Create, I have created a speech service with Standard S0 Tier, You can create it with Free Tier F0 too.

enter image description here

Now, Visit Keys and endpoints in your Speech service left pane > Under Resource Management > Copy the one of the Keys from Key1 and Key2 and also copy the Location region and save it as an environment variable in your terminal from the VS code like below:-

Keys and Endpoint of Speech resource:-

enter image description here

Set SPEECH_KEY and SPEECH_REGION in your terminal to set it as an environment variable like below:-

setx SPEECH_KEY your-key
setx SPEECH_REGION your-region

enter image description here

Method 1) Convert speech to text with your local machine's Microphone, You can use this to integrate with your real time audio:-

import  os

import  azure.cognitiveservices.speech  as  speechsdk

  

def  recognize_from_microphone():

# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"

speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))

speech_config.speech_recognition_language="en-US"

  

audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

  

print("Speak into your microphone.")

speech_recognition_result = speech_recognizer.recognize_once_async().get()

  

if  speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:

print("Recognized: {}".format(speech_recognition_result.text))

elif  speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:

print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))

elif  speech_recognition_result.reason == speechsdk.ResultReason.Canceled:

cancellation_details = speech_recognition_result.cancellation_details

print("Speech Recognition canceled: {}".format(cancellation_details.reason))

if  cancellation_details.reason == speechsdk.CancellationReason.Error:

print("Error details: {}".format(cancellation_details.error_details))

print("Did you set the speech resource key and region values?")

  

recognize_from_microphone()

Output :-

It will first prompt for Speak into your microphone, You speak a sentence and it will be converted into text like below:-

enter image description here

Method 2) Using audio File to convert to text :-

I have added one audio.wav file in the same folder as my code ad added it in audio_config = speechsdk.audio.AudioConfig(filename="audio.wav") Ran the code and got the output from my audio file in the text format below:-

import  os

import  azure.cognitiveservices.speech  as  speechsdk

  

def  recognize_from_microphone():

# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"

speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))

speech_config.speech_recognition_language="en-US"

  

audio_config = speechsdk.audio.AudioConfig(filename="audio.wav")

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

  

speech_recognition_result = speech_recognizer.recognize_once_async().get()

  

if  speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:

print("Recognized: {}".format(speech_recognition_result.text))

elif  speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:

print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))

elif  speech_recognition_result.reason == speechsdk.ResultReason.Canceled:

cancellation_details = speech_recognition_result.cancellation_details

print("Speech Recognition canceled: {}".format(cancellation_details.reason))

if  cancellation_details.reason == speechsdk.CancellationReason.Error:

print("Error details: {}".format(cancellation_details.error_details))

print("Did you set the speech resource key and region values?")

  

recognize_from_microphone()

Output:-

enter image description here

By default only Mp3 and wav 16Khz or 8Hz, 16 Bit mono PCM audio file types are supported, But you can refer below supported formats via G-streamer MP3, OPUS/OGG, FLAC, ALAW in WAV container, MULAW in WAV container, ANY for MP4 container or unknown media format.

Reference:-

Speech-to-text quickstart - Speech service - Azure Cognitive Services | Microsoft Learn

How to use compressed input audio - Speech service - Azure Cognitive Services | Microsoft Learn

Azure speech to text in real time from .raw file

1 Answers1