Is there a way to make Google Text to Speech, speak text for a desired duration?

Question

I went through the documentation of Google Text to Speech SSML. https://developers.google.com/assistant/actions/reference/ssml#prosody

So there is a tag called <Prosody/> which as per the documentation of W3 Specification can accept an attribute called duration which is a value in seconds or milliseconds for the desired time to take to read the contained text.

So <speak><prosody duration='6s'>Hello, How are you?</prosody></speak> should take 3 seconds for google text to speech to speak this! But when i try it here https://cloud.google.com/text-to-speech/ , its not working and also I tried it in rest API.

Does google text to speech doesn't take duration attribute into account? If they don't then is there a way to achieve the same?

Note that the W3 specification is 10 years old and probably out of date. The very first paragraph of the Google doc says "Currently the `rate`, `pitch`, and `volume` attributes are supported." — Mr Lister, May 31 '20 at 13:28
@MrLister Thanks my bad, sorry I missed that line. So any idea on how that could be achieved? I am currently experimenting with pydub, but facing issue with pitch of the voice. — SkyTreasure, May 31 '20 at 13:44
No, sorry, I'm not that well versed in SSML, so I can't help you. All I can say is you can't do it in the way you tried because it simply isn't implemented, not because you're doing it wrong! So I'm afraid you'll have to do more research on the web. — Mr Lister, May 31 '20 at 13:49
Thanks @MrLister, will do research and explore other alternatives. — SkyTreasure, May 31 '20 at 17:14

score 1 · Answer 1 · answered Jul 27 '20 at 13:21

There are two ways I know of to solve this:

First Option: call Google's API twice: use the first call to measure the time of the spoken audio, and the second call to adjust the rate parameter accordingly.
- Pros: Better audio quality? (this is subjective and depends on taste as well as the application's requirements)
- Cons: Doubles the cost and processing time.
Second option: Post-process the audio using a specialized library such as ffmpeg
- Pros: Cost effective and can be fast if implemented correctly.
- Cons: Some knowledge of the concepts and the usage of an audio post-processing library is required (no need to become an expert though).

score 1 · Answer 2 · answered Dec 26 '21 at 21:15

As Mr Lister already mentioned, the documentation clearly says.

<prosody>

Used to customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported.

The rate and volume attributes can be set according to the W3 specifications.

Using the UI interface you can test it.

In particular you can use things like

rate="low"

or

rate="80%"

to adjust the speed. However that is as far as you can go with Google TTS.

AWS Polly does support what you need, but only on Standard voices (not Neural).

Here is the documentation. Setting a Maximum Duration for Synthesized Speech

Polly also has a UI to do a quick test.

Is there a way to make Google Text to Speech, speak text for a desired duration?

2 Answers2