3

If I send this small piece of SSML to the speech processor I get two voices

<speak version='1.0' xml:lang='es-ES'>
  <voice xml:lang='es-ES' xml:gender='Male' name='Microsoft Server Speech Text to Speech Voice (es-ES, Pablo, Apollo)'>
    <p>
        <s>Hola </s>
        <s xml:lang='en'>Hello</s>
        <s>¿Cómo estas?.</s>
    </p>
  </voice>
</speak>

A man in Spanish and a woman in English. Is this a limitation of the Project Oxford Text to Speech engine? in other words, I would expect the same voice to speak several languages but it looks like this is not the case.

IgnacioHR
  • 568
  • 4
  • 20
  • Amazon polly does have the same voice try to pronounce the second language and it's in my opinion a worse outcome as the voice sounds like the second language was learned as a second language and is hard to understand. – user5389726598465 Oct 05 '19 at 08:09
  • Thank you for the comment. The question was asked in 2016 and I think today it is obsolete. Processors today are much better than in 2016 – IgnacioHR Oct 07 '19 at 08:01
  • 1
    No. I am faced with the same issue today. I just got azure cognitive tts(formerly oxford) working in my app today finally with two different languages but different voices. Amazon polly requires a single language specified and different languages just sound bad when pronounced with a non-native voice. Alexa skill seems to support it but I"m not sure. Google cloud does not support two languages. I'm trying to save someone the same time I spent researching the options for bilingual apps, not answer your question. – user5389726598465 Oct 07 '19 at 08:04
  • Amazon Polly has truly bilingual voices, but they are just a few. https://docs.aws.amazon.com/polly/latest/dg/bilingual-voices.html In general, TTS voices are not created multilingual (much harder to train). – Rub Jan 07 '23 at 16:07

1 Answers1

1

To quote the SSML spec,

Specifying xml:lang does not imply a change in voice, though this may indeed occur. When a given voice is unable to speak content in the indicated language, a new voice may be selected by the processor.

While the current fallback behavior leaves something to desire, the recommendation is to create multiple voice nodes and pick a voice more explicitly when switching languages.

cthrash
  • 2,938
  • 2
  • 11
  • 10