0

Google/YouTube automatic speech recognition generates subtitles without marking up their voices.

When you have a lecture there is one voice, but when people are having a conversation, or more than one person is serving as talking head the STT software. They could mark this up since it should be able to detect different tones and timbre of voices as part of spectrally extracting or figuring out phonemes from the audio. This aspect would help while splitting each person input apart into new sentences and paragraphs.

Notice that I don't need to identify a particular speaker/person. I just need to notice the different "voices" participating in a conversation.

I have taken a look at what seems to be a Java wrapper around whatever STT they use (google.cloud.speech.v1), but I don't see such a functionality and I think that should be possible.

Any ideas why they don't do that? or, how it could be done?, or the STT software they use and if it could be somehow configured to do that?

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120

1 Answers1

1

You can use speaker/person diarization. Speech-To-Text can recognize multiple voices from the same audio clip. You need to include the parameter enableSpeakerDiarization and diarizationSpeakerCount when you send an audio transcription request to Speech-To-Text. You need to set enableSpeakerDiarization to True and specify the number of speaker/person in diarizationSpeakerCount parameter in the current audio clip by setting the SpeakerDiarizationConfig parameters for the request. To improve your transcription results. Speech-to-Text uses a default value if you do not provide a value for diarizationSpeakerCount.

You can see this example with Python:

diarization_config = speech.SpeakerDiarizationConfig(
  enable_speaker_diarization=True,
  min_speaker_count=2,
  max_speaker_count=10,
)

You can see a complete example code.

Raul Saucedo
  • 1,614
  • 1
  • 4
  • 13