Hi I'm using pyttsx3 in python which uses the Microsoft SDK SAPI5.1 text to speech synthesizer to generate audio from text. Problem that I'm facing is that the the speed of the speech it generate is not stable and it varies depending on the length of the text , the length of the words etc... It means that the same words would be pronounce faster or slower depending on the text they're in. These is setback for me because I need the timestamp for each word for the program that I'm creating to work properly , so far I try different formulas none of them are accurate.
Anyone has an idea how to solve this ? (ps I don't want to use speech analysis to solve this because of reliability issues)