1

I am able to generate a wav file of "Mary had a little lamb" using the code below. But it fails when I try to generate an mp3

#https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-text-to-speech?tabs=script%2Cwindowsinstall&pivots=programming-language-python

import azure.cognitiveservices.speech as speechsdk

languageCode = 'en-US'
ssmlGender = 'MALE'
voicName = 'en-US-JennyNeural'
speakingRate = '-5%'
pitch = '-10%'
voiceStyle = 'newscast'

azureKey = 'FAKE KEY'
azureRegion = 'FAKE REGION'

#############################################################
#audioOuputFile = './audioFiles/test.wav'
audioOuputFile = './audioFiles/test.mp3'
#############################################################

txt = 'Mary had a little lamb it\'s fleece was white as snow.'
txt+= 'And everywhere that Mary went, the lamb was sure to go,'
txt+= 'It followed her to school one day,'
txt+= 'That was against the rule,'
txt+= 'It made the children laugh and play,'
txt+= 'To see a lamb at school.'

head1 = f'<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="{languageCode}">'
head2 = f'<voice name="{voicName}">'
head3 =f'<mstts:express-as style="{voiceStyle}">'
head4 = f'<prosody rate="{speakingRate}" pitch="{pitch}">'
tail= '</prosody></mstts:express-as></voice></speak>'

ssml = head1 + head2 + head3 + head4 + txt + tail
print('this is the ssml======================================')
print(ssml)
print('end ssml======================================')
print()

speech_config = speechsdk.SpeechConfig(subscription=azureKey, region=azureRegion)
audio_config = speechsdk.AudioConfig(filename=audioOuputFile)

#HERE IS THE PROBLEM
#Without this statement everything works fine
#Can produce a wav file 
speech_config.set_speech_synthesis_output_format(SpeechSynthesisOutputFormat["Audio16Khz128KBitRateMonoMp3"])

synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
synthesizer.speak_ssml_async(ssml)

Here is the console output:

(envo) D:\py_new\tts>python ttsTest3.py this is the ssml====================================== <mstts:express-as style="newscast">Mary had a little lamb it's fleece was white as snow.And everywhere that Mary went, the lamb was sure to go,It followed her to school one day,That was against the rule,It made the children laugh and play,To see a lamb at school.</mstts:express-as> end ssml======================================

Traceback (most recent call last): File "D:\py_new\tts\ttsTest3.py", line 45, in speech_config.set_speech_synthesis_output_format(SpeechSynthesisOutputFormat["Audio16Khz128KBitRateMonoMp3"]) NameError: name 'SpeechSynthesisOutputFormat' is not defined

(envo) D:\py_new\tts>

Note error: NameError: name 'SpeechSynthesisOutputFormat' is not defined

Compare with: Customize audio format

at:

https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-text-to-speech?tabs=script%2Cwindowsinstall&pivots=programming-language-python

It all works fine in Nodejs. But I need to be able to do it in Python as well.

user3567761
  • 145
  • 1
  • 2
  • 9

2 Answers2

3

Try this

speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3)
Tharun K
  • 1,160
  • 1
  • 7
  • 20
0

you need to configure your audio something like this

def hindi_text_to_speech_azure(hindi_text):
    speech_config = SpeechConfig(subscription=SPEECH_KEY, region=LOCATION_AREA)
    # Note: if only language is set, the default voice of that language is chosen.
    speech_config.speech_synthesis_language = LANGUAGE_LOCATION_HINDI  # e.g. "de-DE"
    # The voice setting will overwrite language setting.
    # The voice setting will not overwrite the voice element in input SSML.
    speech_config.speech_synthesis_voice_name = MALE_VOICE_NAME_HINDI

    audio_config = AudioOutputConfig(
    filename="{name}.mp3".format(name=hindi_text[:30]))

    synthesizer = SpeechSynthesizer(
    speech_config=speech_config, audio_config=audio_config)
    synthesizer.speak_text_async(hindi_text)

try this.

but the issue is that actually it's not an issue but I stuck on there the file save on locally but I want to upload this on the server(default storage) instant in local storage. did you know?

Rohit Singh
  • 111
  • 5
  • Hi Rohit, when you do that you get a file with an mp3 extension. But it's actually a wav file. The problem with a wave fiIe, of course, is that it takes up too much bandwidth. I think I may just use ffmpeg to convert the output to whatever format the customer wants. Maybe OGG. – user3567761 Jan 08 '22 at 20:32
  • I don't know about that. Thanks – Rohit Singh Jan 09 '22 at 09:27
  • 1
    @user3567761 so did you get any answer I also need my audio op is also too much large only 7sec Audio is consume 300kb – Rohit Singh Jan 13 '22 at 09:39
  • Hi Rohit. I went a different route. I generate wav files and then use ffmpeg to convert them to mp3. For my particular workflow that works well. – user3567761 Jan 26 '22 at 23:08
  • You need to set the output format to mp3, e.g. with go I think this works: `err = speechConfig.SetOutputFormat(common.OutputFormat(audio.MP3))` – jbrown Jul 25 '22 at 06:13