how to calculate duration of each word when using Microsoft Text to speech?

Question

Hi I'm using pyttsx3 in python which uses the Microsoft SDK SAPI5.1 text to speech synthesizer to generate audio from text. Problem that I'm facing is that the the speed of the speech it generate is not stable and it varies depending on the length of the text , the length of the words etc... It means that the same words would be pronounce faster or slower depending on the text they're in. These is setback for me because I need the timestamp for each word for the program that I'm creating to work properly , so far I try different formulas none of them are accurate.

Anyone has an idea how to solve this ? (ps I don't want to use speech analysis to solve this because of reliability issues)

score 1 · Answer 1 · answered May 11 '22 at 20:24

You need to set up event handlers to get a notification on each word. Apparently pyttsx uses the connect API to set up events:

engine.connect('started-utterance', onStart)
engine.connect('started-word', onWord)
engine.connect('finished-utterance', onEnd)

The onWord signature has the duration (I believe).

how to calculate duration of each word when using Microsoft Text to speech?

1 Answers1