Google/YouTube automatic speech recognition generates subtitles without marking up their voices.
When you have a lecture there is one voice, but when people are having a conversation, or more than one person is serving as talking head the STT software. They could mark this up since it should be able to detect different tones and timbre of voices as part of spectrally extracting or figuring out phonemes from the audio. This aspect would help while splitting each person input apart into new sentences and paragraphs.
Notice that I don't need to identify a particular speaker/person. I just need to notice the different "voices" participating in a conversation.
I have taken a look at what seems to be a Java wrapper around whatever STT they use (google.cloud.speech.v1
), but I don't see such a functionality and I think that should be possible.
Any ideas why they don't do that? or, how it could be done?, or the STT software they use and if it could be somehow configured to do that?