Mixing languages in the same SSML

Question

If I send this small piece of SSML to the speech processor I get two voices

<speak version='1.0' xml:lang='es-ES'>
  <voice xml:lang='es-ES' xml:gender='Male' name='Microsoft Server Speech Text to Speech Voice (es-ES, Pablo, Apollo)'>
    <p>
        <s>Hola </s>
        <s xml:lang='en'>Hello</s>
        <s>¿Cómo estas?.</s>
    </p>
  </voice>
</speak>

A man in Spanish and a woman in English. Is this a limitation of the Project Oxford Text to Speech engine? in other words, I would expect the same voice to speak several languages but it looks like this is not the case.

Amazon polly does have the same voice try to pronounce the second language and it's in my opinion a worse outcome as the voice sounds like the second language was learned as a second language and is hard to understand. — user5389726598465, Oct 05 '19 at 08:09
Thank you for the comment. The question was asked in 2016 and I think today it is obsolete. Processors today are much better than in 2016 — IgnacioHR, Oct 07 '19 at 08:01
No. I am faced with the same issue today. I just got azure cognitive tts(formerly oxford) working in my app today finally with two different languages but different voices. Amazon polly requires a single language specified and different languages just sound bad when pronounced with a non-native voice. Alexa skill seems to support it but I"m not sure. Google cloud does not support two languages. I'm trying to save someone the same time I spent researching the options for bilingual apps, not answer your question. — user5389726598465, Oct 07 '19 at 08:04
Amazon Polly has truly bilingual voices, but they are just a few. https://docs.aws.amazon.com/polly/latest/dg/bilingual-voices.html In general, TTS voices are not created multilingual (much harder to train). — Rub, Jan 07 '23 at 16:07

score 1 · Accepted Answer · answered Oct 04 '16 at 16:06

To quote the SSML spec,

Specifying xml:lang does not imply a change in voice, though this may indeed occur. When a given voice is unable to speak content in the indicated language, a new voice may be selected by the processor.

While the current fallback behavior leaves something to desire, the recommendation is to create multiple voice nodes and pick a voice more explicitly when switching languages.

Mixing languages in the same SSML

1 Answers1