1

I know there is a way to get delineated words by speaker using the google cloud speech to text API. I'm looking for a way to get the timestamps of when a speaker changes for a longer file. I know that Descript must do something like this under the hood. , which I am trying to replicate. My desired end result is to be able to split an audio file with multiple speakers into clips of each speaker, in the order that they occurred.

I know I could probably extract timestamps for each word and then iterate through the results, getting the timestamps for when a previous result is a different speaker than the current result. This seems very tedious for a long audio file and I'm not sure how accurate this is.

jeanmw
  • 446
  • 2
  • 17
  • I know someone else solved it (sort of) in this way many years ago but they had to do it by parsing out the timestamps of each word - I'm hoping to avoid that. https://stackoverflow.com/questions/50900340/speech-to-text-map-speaker-label-to-corresponding-transcript-in-json-response – jeanmw Oct 16 '22 at 00:46

1 Answers1

0

Google "Speech to text" - phone model does what you are looking at by giving result end times for each identified speaker. Check more here https://cloud.google.com/speech-to-text/docs/phone-model

mohsyn
  • 186
  • 12