How is Alexa programmed to sing?

Question

If you say "Alexa, sing for me", she will choose one of several songs that have been created with her voice. The voice(s) for each of these songs must have been created somehow.

At first, I thought that SSML would provide the tools necessary to do this, especially the <prosody> tag which has parameters for pitch and rate (duration).

I thought perhaps each syllable of singing could have its pronunciation specified with <phoneme> and its pitch and duration specified with <prosody>, with <break> tags in between:

<speak>
  <prosody rate="20%">
    <phoneme alphabet="x-sampa" ph="U">oo</phoneme>
    <break strength="none" />
  </prosody>
  <prosody rate="20%" pitch="+50%">
    <phoneme alphabet="x-sampa" ph="U">oo</phoneme>
    <break strength="none" />
  </prosody>
  <prosody rate="20%">
    <phoneme alphabet="x-sampa" ph="U">oo</phoneme>
  </prosody>
</speak>

However, when executed, Alexa applies her built-in inflection (to sound like a real human), and so the tone is not flat. These "ooh" sounds (above), for example, each have a falling tone. (They also have a noticeable break between phonemes even tho "no break" was explicitly specified.)

So then, how did the Alexa voice which is heard singing all of those songs get programmed? Was it via tools currently only available to Amazon developers?

It's also perplexing to me that I am apparently the only person on the internet even asking this question (based on zero results in stackoverflow, google, etc), especially this late in the game. Aren't there loads of musicians out there who would love to be able to make Alexa sing whatever they want?

Edit: Guys, I thought it was common knowledge, but there is no human voice actor behind Alexa. Her voice is completely computer-generated.

Amit Singh · Accepted Answer · 2021-01-01T21:15:13.443

1

Alexa's voice is completely computer generated and so are the songs. Research is on-going into generating a singing synthesizer model (#1 and #2).

Here's a video by Popgun Labs regarding how they make their AI sing. Although I am unable to find how Amazon and Google do this, my guess it will be something similar.

EDIT: My earlier answer was based on an extension page and drew incorrect inconclusions.

edited Jan 01 '21 at 21:15

answered Jan 01 '21 at 15:56

Amit Singh

2,875
14
30

But considering that Alexa's voice is computer-generated / doesn't come from any particular human, how can these songs be "recorded"? – jdunk Jan 01 '21 at 20:00
@jdunk Yes, it is completely computerised, however, research has not advanced enough that Alexa can sing songs so those are pre-recorded. It's similar to how you can pay extra to get a celebrity's voice on your Alexa. – Amit Singh Jan 01 '21 at 20:19
It would be impossible to get a human to record every known word and their combination. – Amit Singh Jan 01 '21 at 20:22
Those celebrity voices are 1) from real humans and 2) don't sing, right? Are you saying that a real human is hired, is recorded singing, and then the recording of this real human voice is somehow changed to sound like Alexa's voice? – jdunk Jan 01 '21 at 20:29
1

My first statement is probably misdirected since it comes from a page run by an extension provided by a third party. Let me try to find more about it and then answer this back. – Amit Singh Jan 01 '21 at 20:55
1

@jdunk Updated the answer to reflect my latest research. – Amit Singh Jan 01 '21 at 21:15

score 0 · Answer 2 · answered Jan 01 '21 at 11:21

My prediction would be either something really fancy like Natural Language Processing or something around that lines, AI/ML or they just had the voice actor sing out something or sing particular tones and just cut them together, i don't own an Alexa but i do have a HomePod mini and an iPhone and the way it pronounces our local singer names like "sidhu moosewala" or "amrit maan" (off topic but still related) i believe they just cut and put together words in a "clean" and 'flowing" way.

score 0 · Answer 3 · answered Jan 02 '21 at 07:34

0

Perhaps her voice is simply autotuned.

Certainly, pitch-shifting tools can force any desired pitch from any audio source, and I presume such tools can force duration changes as well.

answered Jan 02 '21 at 07:34

jdunk

2,738
2
17
25

How is Alexa programmed to sing?

3 Answers3