How to get phonemes from Google Cloud API Text-to-Speech

Question

I am following the Google Cloud API Text-to-Speech Python tutorial. I would like to know if there is a way to return the phonemes and their duration, an intermediate step in generating the interpreted speech. Is that possible? If so, can you please refer me to the documentation and hopefully some sample code that does this. I searched and could not find anything that already answered my question.

Thanks! gma

score 2 · Answer 1 · answered May 05 '21 at 08:52

Mentioning all the steps to get phonemes from Google cloud API Text-to-Speech. In Part-3, you can find the sample code. Here are the steps you can follow:

[Part-1]

In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Cloud project
Enable the Cloud Text-to-Speech API.
Create a service account: a. In the Cloud Console, go to the Create service account page. b. Select a project. c. In the Service account name field, enter a name. The Cloud Console fills in the Service account ID field based on this name. d. Click Done to finish creating the service account. Do not close your browser window. You will use it in the next step.
Create a service account key: a. In the Cloud Console, click the email address for the service account that you created. b. Click Keys. c. Click Add key, then click Create new key. d. Click Create. A JSON key file is downloaded to your computer. e. Click Close.
Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.

Example 1. Linux or macOS export GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"

Replace KEY_PATH with the path of the JSON file that contains your service account key.

For example:- export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"

Example 2. Windows

For powershell:

$env:GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"

Replace KEY_PATH with the path of the JSON file that contains your service account key.

For example:

$env:GOOGLE_APPLICATION_CREDENTIALS="C:\Users\username\Downloads\service-account-file.json"

For command promt:

set GOOGLE_APPLICATION_CREDENTIALS=KEY_PATH

Replace KEY_PATH with the path of the JSON file that contains your service account key.
Install and initialize the cloud SDK.

[Part-2]

Install the client library

pip install --upgrade google-cloud-texttospeech

[Part-3]

Create audio data

Now you can use Text-to-Speech to create an audio file of synthetic human speech. Use the following code to send a synthesize request to the Text-to-Speech API.

from google.cloud import texttospeech

client = texttospeech.TextToSpeechClient()

synthesis_input = texttospeech.SynthesisInput(text="Hello, World!")

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3
)

response = client.synthesize_speech(
    input=synthesis_input, voice=voice, audio_config=audio_config
)

with open("output.mp3", "wb") as out:
    out.write(response.audio_content)
    print('Audio content written to file "output.mp3"')

If you face any issue, please refer to the link below:

https://cloud.google.com/text-to-speech/docs/quickstart-client-libraries#client-libraries-install-python

score 2 · Answer 2 · answered May 05 '21 at 10:25

Thanks for your reply @Akshansha. I know how to create an audio file of synthetic human speech. My question was more about how to get metadata like phoneme or viseme. For exemple, with the Amazon Polly API you can get this kind of data when using Text-to-Speech :

{"time":0,"type":"sentence","start":0,"end":23,"value":"Mary had a little lamb."}
{"time":6,"type":"word","start":0,"end":4,"value":"Mary"}
{"time":6,"type":"viseme","value":"p"}
{"time":73,"type":"viseme","value":"E"}
{"time":180,"type":"viseme","value":"r"}
{"time":292,"type":"viseme","value":"i"}
{"time":373,"type":"word","start":5,"end":8,"value":"had"}
{"time":373,"type":"viseme","value":"k"}
{"time":460,"type":"viseme","value":"a"}
{"time":521,"type":"viseme","value":"t"}
{"time":604,"type":"word","start":9,"end":10,"value":"a"}
{"time":604,"type":"viseme","value":"@"}
{"time":643,"type":"word","start":11,"end":17,"value":"little"}
{"time":643,"type":"viseme","value":"t"}
{"time":739,"type":"viseme","value":"i"}
{"time":769,"type":"viseme","value":"t"}
{"time":799,"type":"viseme","value":"t"}
{"time":882,"type":"word","start":18,"end":22,"value":"lamb"}
{"time":882,"type":"viseme","value":"t"}
{"time":964,"type":"viseme","value":"a"}
{"time":1082,"type":"viseme","value":"p"}

I was asking if we can have a similar result with the Google Cloud API TTS ?

Thanks, gma

It looks like from the [request options](https://cloud.google.com/text-to-speech/docs/reference/rpc/google.cloud.texttospeech.v1beta1#timepointtype) that google only just started (in beta) supporting SSML marks, but don't provide phoneme timing like AWS — Luke, May 05 '21 at 12:59
Thanks for your answer @Luke . It may insterest me but I can't find any examples on how to use this. Do you have some samples of code or some links which go more deeper on this subject because the google's documentation is not very precise and I don't understand how to use it. Thanks — gma, May 06 '21 at 18:15
The [AWS Polly](https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#custom-tag) docs are friendlier, or even [IBM Watson's docs](https://cloud.ibm.com/docs/text-to-speech?topic=text-to-speech-timing#mark) — Luke, May 08 '21 at 00:37

How to get phonemes from Google Cloud API Text-to-Speech

2 Answers2