0

I am synthesising text using Azure Speech Service's TTS. When setting the audio config, I want to disable the playback of the audio. Per the documentation, AudioOutputConfig's use_default_speaker keyword is False by default. Hence, the following code should work:

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription=os.environ.get('SPEECH_KEY'),
    region=os.environ.get('SPEECH_REGION')
    )
audio_config = speechsdk.audio.AudioOutputConfig()

but I get the following error:

ValueError: default speaker needs to be explicitly activated

The same goes if I set use_default_speaker=True. The only way I can get the code to run is if I explicitly set use_default_speaker=False, but this way the audio is spoken to the computer's speakers, which is annoying and time consuming when generating multiple samples.

I also tried experimenting with the stream keyword, but I can't figure out what to set it to.

I don't want to write the data to a wav file using the filename kw.

Does anyone know how I can turn off the behaviour of playing back the audio?

jutta
  • 21
  • 5

2 Answers2

1

I found out by trial and error using different options from the Azure documentation, though they weren't particularly helpful. It turns out you can use PullAudioOutputStream() as your audio config:

import azure.cognitiveservices.speech as speechsdk
import os

speech_config = speechsdk.SpeechConfig(
    subscription=os.environ.get('SPEECH_KEY'),
    region=os.environ.get('SPEECH_REGION')
    )
audio_config = speechsdk.audio.PullAudioOutputStream() # Change here

speech_synthesiser = speechsdk.SpeechSynthesizer(
            speech_config=speech_config, audio_config=audio_config
        )

xml_str = """<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" version="1.0" xml:lang="sv-SE"><voice name="sv-SE-SofieNeural">Hej</voice></speak>"""
speech_synthesis_result = speech_synthesiser.speak_ssml(xml_str)
bytearray = speech_synthesis_result.audio_data[44:] # removing the riff header 

A heads up: you may want to remove the RIFF header if you want to stitch together multiple audio bytearrays without introducing click noises.

jutta
  • 21
  • 5
0

I tried in my environment and got the below results:

Initially, I received the same error as yours for the code below:-

import  azure.cognitiveservices.speech  as  speechsdk

import  os

  

speech_key  =  "<Your_key>"

speech_region  =  "<Your_region>"

speech_config  =  speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)

audio_config  =  speechsdk.audio.AudioOutputConfig()

enter image description here

I added a piece of code to overcome the above error check below,

import  azure.cognitiveservices.speech  as  speechsdk

import  time

  

speech_config  =  speechsdk.SpeechConfig(subscription="<Your_key>", region="<Your_region>")

synthesizer  =  speechsdk.SpeechSynthesizer(speech_config=speech_config)

result  =  synthesizer.speak_text_async("Hello, World!").get()

  

time.sleep(1)

  

synthesizer.stop_speaking()

Output :

enter image description here

I am able to reach the requirement that, I can stop the play back audio by the generated samples.

Dasari Kamali
  • 811
  • 2
  • 2
  • 6
  • Thank you for your response, but this wasn't exactly what I was looking for. I don't want to have to stop it manually, but rather just get the data returned immediately. – jutta Mar 23 '23 at 10:28
  • which formate of data do you want exactly. – Dasari Kamali Mar 23 '23 at 10:42