I know there is a way to get delineated words by speaker using the google cloud speech to text API. I'm looking for a way to get the timestamps of when a speaker changes for a longer file. I know that Descript must do something like this under the hood. , which I am trying to replicate. My desired end result is to be able to split an audio file with multiple speakers into clips of each speaker, in the order that they occurred.
I know I could probably extract timestamps for each word and then iterate through the results, getting the timestamps for when a previous result is a different speaker than the current result. This seems very tedious for a long audio file and I'm not sure how accurate this is.