Is there a way to get timestamps of speaker switch times using Google Cloud's speech to text service?

Question

I know there is a way to get delineated words by speaker using the google cloud speech to text API. I'm looking for a way to get the timestamps of when a speaker changes for a longer file. I know that Descript must do something like this under the hood. , which I am trying to replicate. My desired end result is to be able to split an audio file with multiple speakers into clips of each speaker, in the order that they occurred.

I know I could probably extract timestamps for each word and then iterate through the results, getting the timestamps for when a previous result is a different speaker than the current result. This seems very tedious for a long audio file and I'm not sure how accurate this is.

I know someone else solved it (sort of) in this way many years ago but they had to do it by parsing out the timestamps of each word - I'm hoping to avoid that. https://stackoverflow.com/questions/50900340/speech-to-text-map-speaker-label-to-corresponding-transcript-in-json-response — jeanmw, Oct 16 '22 at 00:46

score 0 · Answer 1 · answered Oct 28 '22 at 16:57

0

Google "Speech to text" - phone model does what you are looking at by giving result end times for each identified speaker. Check more here https://cloud.google.com/speech-to-text/docs/phone-model

answered Oct 28 '22 at 16:57

mohsyn

186
12

Is there a way to get timestamps of speaker switch times using Google Cloud's speech to text service?

1 Answers1