I am building an app with React that uses Microsoft Speech to handle Text to Speech (TTS) tasks.
In the app there is a process that fetches the response from ChatGPT as a stream then feed each complete sentence into the TTS queue. There is a text box that will display the all the current tokens. The tokens need to form a sentence, plus a delay to convert that text into speech, therefore, there is a significant delay between the text displayed and the speech.
I want to display the text word-by-word in sync with Microsoft Speech. I would like to know if Microsoft TTS provides the timestamps where the words are spoken. For example, something similar like this: input - "How are you?", output - [{word: "How", timestamp: 0}, {word: "are", timestamp: 0.5}, {word: "you?", timestamp: 0.9}]. Or if there is any event that notifies when a word is spoken.